Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: use pydantic for parameter validation #514

Draft
wants to merge 32 commits into
base: develop
Choose a base branch
from

Conversation

DerWeh
Copy link
Contributor

@DerWeh DerWeh commented Feb 26, 2024

This is a first proof of concept to use pydantic for the validation of parameters. I used Python 3.10 syntax as it is more concise and used pydantic=2.6.2 without putting much thought into it.

Currently, the fit methods is extremely messy due to all the checks. Of course, a refactoring putting this into a dedicated method would already simplify things.
I think using pydantic might be a more elegant solution. This draft is mostly for discussion, whether this is a good idea or not. I tried to be faithful to the checks that were implemented.
A disadvantage is that pydantic errors are harder to read. On the plus side, the type annotations are much clearer than having to search through the fit method. It's probably also much easier to maintain the validation.

Currently, I break the private versions of the classifier. Attributes were set, depending on whether the estimator is private or not. Do you think it would be better to define a super class with the common arguments and subclasses with the additional arguments?

Another question is how to best proceed with overwriting the default arguments. Currently, there is a lot of redundancy.

If we want to follow through with this idea, I think some sklearn checks should be enabled again using parametrize_with_checks or check_estimator

@paulbkoch
Copy link
Collaborator

Hi @DerWeh -- I like what you're doing here and plan to merge it once it fully replaces the existing checks and passes the tests.

A few thoughts:
I think a good minimum version that we should target would be python 3.8 since that's still supported. Thankfully 3.8 supports things like the Literal annotation.

We can add some classes to formalize the DP vs Regular EBM class partition. Someday, I think we might support multi-output EBMs, and one strange type of EBM might be an EBM that has both classification and regression outputs. To create one of the strange beasts, you'd need to directly create an EBMModel object, so I like having the ability to make and use these classes directly. I can't think of any good reason though to mix private and non-private EBMs, so the private vs regular partition can be a first split in the class hierarchy below the EBMModel class.

You probably already saw that I had a test for check_estimator, but disabled it. I think we can probably now increase our minimum scikit-learn version to allow for this check now. We didn't pass all the checks since EBMs are more permissive in what they accept than scikit-learn, and I'd like to preserve our user-friendliness in that respect, so I expect we'll want to continue disabling some of the scikit-learn checks.

@DerWeh
Copy link
Contributor Author

DerWeh commented Feb 28, 2024

I think a good minimum version that we should target would be python 3.8 since that's still supported. Thankfully 3.8 supports things like the Literal annotation.

Sure enough, I typically orient myself at NEP-29, but it is a good idea to be a more conservative (after all, now NumPy is already very stable such that using an older NumPy version is typically sufficient).
I just wanted to avoid the additional boilerplate for type hints for the prototype, I case we reject the idea altogether.

Someday, I think we might support multi-output EBMs, and one strange type of EBM might be an EBM that has both classification and regression outputs.

Out of curiosity: what do advantage do you expect? I don't really see the synergy for multi-output in EBMs compared to deep networks (or deep trees). Can we expect an advantage during the boosting phase?
But this is mostly an academic question. In practice, I am currently mostly interested in more convenient multi-class support. Interactions and post-processing seem to still be lacking.

You probably already saw that I had a test for check_estimator, but disabled it.
Yes, I saw it. The PR for the non-private EBMs is ready.

As soon as it is merged, (and when I find time) I can continue working on pydantic.

@paulbkoch
Copy link
Collaborator

paulbkoch commented Feb 28, 2024

Someday, I think we might support multi-output EBMs, and one strange type of EBM might be an EBM that has both classification and regression outputs.

Out of curiosity: what do advantage do you expect? I don't really see the synergy for multi-output in EBMs compared to deep networks (or deep trees). Can we expect an advantage during the boosting phase? But this is mostly an academic question. In practice, I am currently mostly interested in more convenient multi-class support. Interactions and post-processing seem to still be lacking.

I think you're right that there will be very little technical benefit. Mostly, I think of it as potentially a nice packaging solution for grouping related predictions, although I might reject the idea if it adds anything more than a little complexity. EBMs do have some benefits in terms of the compactness of the model representation. For example, SoftAtHome, which wrote Ebm2Onnx (https://github.com/interpretml/ebm2onnx) uses EBMs inside their routers, and I also saw a paper where some company was evaluating EBMs to be used inside pacemakers (!!!). Sharing the feature definitions, and bin cuts might be nice in those applications both for the compactness of the model, because it could lower the computation requirements of predictions, and also to speed up predictions.

Probably more likely than boosting multi-output models together would be constructing two separate EBMs and then merging them together as a post process step similarly to how merge_ebm works today. But who knows, maybe someone really will want some super weird objective that includes both classification and regression. I'd like that even if just for the entertainment value of seeing someone use an EBM this way.

@paulbkoch
Copy link
Collaborator

But this is mostly an academic question. In practice, I am currently mostly interested in more convenient multi-class support. Interactions and post-processing seem to still be lacking.

We don't allow multi-class interactions in the default API entirely because of the difficulties in visualizing them. You can construct multi-class EBMs with interactions today if you use the measure_interactions function manually. Here's a slightly related example showing how to use it:

https://interpret.ml/docs/python/examples/custom-interactions.html

@paulbkoch
Copy link
Collaborator

Hi @DerWeh -- I've been thinking through this question the last few days and here's what I came up with in terms of a hierarchy:

EBMBase -> abstract base class for all EBM models
EBMModel(EBMBase) -> instantiable class for normal EBM classifiers/regressors
ExplainableBoostingClassifier(EBMModel)
ExplainableBoostingRegressor(EBMModel)
DPEBMModel(EBMBase) -> instantiable class for differential privacy classifiers/regressors
DPExplainableBoostingClassifier(DPEBMModel)
DPExplainableBoostingRegressor(DPEBMModel)

Thoughts?

Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 4, 2024

Yes, the hierarchy sound very reasonable. Though, I am not quite sure if making EBMBase and abstract base class has any benefit. See the discussion: https://stackoverflow.com/questions/3570796/why-use-abstract-base-classes-in-python

But it doesn't really seem very important to me, so I am fine with anything/


However, now that EBMs have the checks for scikit-learn compatibility, I ran into a bit of a problem: clone doesn't work with parameter validation anymore. clone performs explicit is checks (old_parameter is new_parameter) which fail due to pydantic...
pydantic doesn't offer the option to defer the validation, see the rejected request pydantic/pydantic#559. I knew, that scikit-learn urges you to put the validation into the fit method (https://scikit-learn.org/stable/developers/develop.html#instantiation) but I didn't think they are this pedantic. Currently, the only option I see to get through with this PR, is providing our one custom implementation for clone, by providing as custom _clone_parametrized, but this seems somewhat ugly...

What is your opinion @paulbkoch?

@paulbkoch
Copy link
Collaborator

Yes, the hierarchy sound very reasonable. Though, I am not quite sure if making EBMBase and abstract base class has any benefit. See the discussion: https://stackoverflow.com/questions/3570796/why-use-abstract-base-classes-in-python

But it doesn't really seem very important to me, so I am fine with anything/

You're probably right that it has very little technical benefit in python. I was thinking of it mainly as a form of documentation to indicate that the EBMBase class should not be instantiated directly and then just using duck typing as we currently are. I haven't written as much python as you have (seemingly) and have never made an abstract class in python, but I'm assuming adding the abstract part is fairly lightweight.

However, now that EBMs have the checks for scikit-learn compatibility, I ran into a bit of a problem: clone doesn't work with parameter validation anymore. clone performs explicit is checks (old_parameter is new_parameter) which fail due to pydantic... pydantic doesn't offer the option to defer the validation, see the rejected request pydantic/pydantic#559. I knew, that scikit-learn urges you to put the validation into the fit method (https://scikit-learn.org/stable/developers/develop.html#instantiation) but I didn't think they are this pedantic. Currently, the only option I see to get through with this PR, is providing our one custom implementation for clone, by providing as custom _clone_parametrized, but this seems somewhat ugly...

What is your opinion @paulbkoch?

Yeah, that does seem a bit ugly. I didn't realize pydantic was unable to use deferred checks. Even if we get the clone function working though, it's not clear if the scikit-learn requirement to defer checks will break anything else now or in the future.

Maybe we should instead move back to option #2 @DerWeh where we move the checks into a separate _validate function? That would at least move them out of the way so that someone reading the fit function would only need to see them if they were specifically interested in the checks. And we'd get to keep the more descriptive error messages too.

@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 5, 2024

You're probably right that it has very little technical benefit in python. I was thinking of it mainly as a form of documentation to indicate that the EBMBase class should not be instantiated directly and then just using duck typing as we currently are. I haven't written as much python as you have (seemingly) and have never made an abstract class in python, but I'm assuming adding the abstract part is fairly lightweight.

I agree, it should be lightweight, so I never did a benchmark (or in fact even use it). For documentation purpose, the most important part is probably not to make the class public.

Maybe we should instead move back to option #2 @DerWeh where we move the checks into a separate _validate function? That would at least move them out of the way so that someone reading the fit function would only need to see them if they were specifically interested in the checks. And we'd get to keep the more descriptive error messages too.

Sure, having a separate _validate functions is already a first good step. A question to consider is if it is still a good idea to separate EBMs and the DB versions (using your suggested class hierarchy), and if we use the vanilla dataclass offered by Python. It doesn't do any validation, but avoids the boilerplate for writing the __init__ function with all the self.attribute = attribute statements.

Another question is, if I should keep the 'Yoda-conditions' that are currently used for type-checks (example: passthrough < 0.0 or 1.0 < passthrough). Is this the code-style you use? I don't really know why, but my brain doesn't seem to like these statements, and I always have to mentally convert them to variable < constant. Of course, this is a very specific example, where we could use Pythons comparisons not (0.0 <= passthrough <= 1.0) (for some reason this is easy to read for me, but the single 0.0 <= passthrough takes time to digest).

Anyway, I'll play around a bit more in case I find a reasonable (elegant) way to use pydantic for validation. If I don't find a way, I'll at least do the refactor.


For documentation purpose: just gave it a try out of curiosity, and it is __sklearn_clone__ which we have to overwrite. Documentation indicates, however, that scikit-learn>=1.3 is required.

   def __sklearn_clone__(self, *, safe=True):
        """Customized implementation of clone. See :func:`sklearn.base.clone` for details."""
        klass = self.__class__
        new_object_params = self.get_params(deep=False)
        for name, param in new_object_params.items():
            new_object_params[name] = clone(param, safe=False)

        new_object = klass(**new_object_params)
        try:
            new_object._metadata_request = copy.deepcopy(self._metadata_request)
        except AttributeError:
            pass

        # _sklearn_output_config is used by `set_output` to configure the output
        # container of an estimator.
        if hasattr(self, "_sklearn_output_config"):
            new_object._sklearn_output_config = copy.deepcopy(
                self._sklearn_output_config
            )
        return new_object

@paulbkoch
Copy link
Collaborator

Sure, having a separate _validate functions is already a first good step. A question to consider is if it is still a good idea to separate EBMs and the DB versions (using your suggested class hierarchy), and if we use the vanilla dataclass offered by Python. It doesn't do any validation, but avoids the boilerplate for writing the __init__ function with all the self.attribute = attribute statements.

Avoiding boilerplate through dataclass sounds like a good idea. I guess this implies having the class hierarchy since we conditionally assign instance attributes based on whether the class is differentially private or not.

Another question is, if I should keep the 'Yoda-conditions' that are currently used for type-checks (example: passthrough < 0.0 or 1.0 < passthrough). Is this the code-style you use? I don't really know why, but my brain doesn't seem to like these statements, and I always have to mentally convert them to variable < constant. Of course, this is a very specific example, where we could use Pythons comparisons not (0.0 <= passthrough <= 1.0) (for some reason this is easy to read for me, but the single 0.0 <= passthrough takes time to digest).

If you want to change the ordering, go for it! I do use Yoda-conditionals in C++ to combat assignment bugs since "if(a = 5)" compiles, but "if(5 = a)" does not, but I don't care about the order in python since assignment isn't legal inside a conditional in python.

Anyway, I'll play around a bit more in case I find a reasonable (elegant) way to use pydantic for validation. If I don't find a way, I'll at least do the refactor.

Sounds good. Thanks again for investigating all these options!

DerWeh added 11 commits March 7, 2024 01:35
We use a regular Python dataclass in combination with a deferred
parameter validation using pydantic's `TypeAdaptor`. This is necessary to
be compatible to scikit-learn's `clone` function.

Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
In case we are not interested in an accurate result, we can drastically
reduce parameters to reduce the fit time.
On my old laptop, this reduces test time from 8 minutes to 1 minute.

Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 7, 2024

Ok, the parameter validation seems to be working now. I managed to find a way based on pydantics TypeAdapter.

The DP versions are currently broken, as I focused only on the non-private versions.

I am currently looking into breaking up the fit method a bit to look for away to avoid as much code duplications as possible when splitting between DP and non-private EBMs.
But this will probably take quite some time. The fit methods is quite the monolith, so it takes a lot of time to understand what's going on.

Signed-off-by: DerWeh <andreas.weh@web.de>

noise_scale_binning_: float
noise_scale_boosting_: float
objective: Literal["log_loss"] = "log_loss"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

objective can be any string, as above.

@@ -3259,103 +2932,17 @@ class DPExplainableBoostingRegressor(EBMModel, RegressorMixin, ExplainerMixin):
noise_scale_boosting\\_ : float
The noise scale during boosting.
"""
objective: Literal["rmse"] = "rmse"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

objective can be any string, as above.

@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 10, 2024

@paulbkoch I'll answer your comments in detail as soon as I find the time. For now, I just can give a more general remark:
Currently, I kept the EBMModel and introduced a new DPEBMModel subclass, which basically pins the attributes that are unnecessary for DP. I fully agree, that a cleaner approach is to have the class structure you mentioned: a common base class and a non-private and a DP subclass.
The issue is that at the moment, everything is quite mixed in the fit method, such that it's no particular easy to separate the parts without error-prone code duplications.

The big question is, if we use such a monkey-patch approach as an intermediate step, followed by additional clean-up PRs, or if we go for a total revamp from the start. Currently, I went for the first version, as the PR is already a huge change, and I am an outsider of the project. I don't really want to mess things up, and incremental merges might be easier to merge as the changes are more traceable.
But if you want to get it 'right' from the start, I can also try to split non-private and DP right in this PR.

@paulbkoch
Copy link
Collaborator

paulbkoch commented Mar 10, 2024

Thanks for the background @DerWeh. We can split it up into 2 PRs if that's easier. The drawback I see is that it temporarily puts the project in a state where we shouldn't put out a release, but I think that's fine in the short term since I get the sense this is something we can plug the hole in quickly. It shouldn't break any existing code if anyone clones it, so that's ok too.

@paulbkoch
Copy link
Collaborator

Hi @DerWeh -- I just noticed something a bit concerning. I downloaded the documentation produced from the PR and the documentation for the ExplainableBoostingClassifier and ExplainableBoostingRegressor is missing. Not sure what the problem is yet.

The docs are available to download here:
https://dev.azure.com/ms/interpret/_build/results?buildId=552050&view=artifacts&pathAsName=false&type=publishedArtifacts

We use Jupyter Book to make our documentation which sits on top of Sphinx. The ExplainableBoostingClassifier is documented here:
https://github.com/interpretml/interpret/blob/develop/docs/interpret/python/api/ExplainableBoostingClassifier.ipynb

I'll have a look.

@paulbkoch
Copy link
Collaborator

paulbkoch commented Mar 10, 2024

On the documentation, I think we might just need to add the sphinx-pydantic extension:

https://sphinx-pydantic.readthedocs.io/en/latest/

here:

extra_extensions:

@paulbkoch
Copy link
Collaborator

It would be great if you could help me out with the last failing test. The test tests/test_explainers.py::test_spec_synthetic fails for explainer_class = <class 'interpret.greybox._shaptree.ShapTree'>. I don't see how this relates to my changes in the EBM. Can you give me a hint?

Hi @DerWeh -- I've resolved the ShapTree test that was failing on the develop branch. It was due to a change in the SHAP API.

@paulbkoch paulbkoch marked this pull request as ready for review March 13, 2024 07:52
Comment on lines -3296 to -3323
feature_names: Optional[Sequence[Union[None, str]]] = None,
feature_types: Optional[
Sequence[Union[None, str, Sequence[str], Sequence[float]]]
] = None,
# Preprocessor
max_bins: int = 32,
# Stages
exclude: Optional[Sequence[Union[int, str, Sequence[Union[int, str]]]]] = None,
# Ensemble
validation_size: Optional[Union[int, float]] = 0,
outer_bags: int = 1,
# Boosting
learning_rate: float = 0.01,
max_rounds: Optional[int] = 300,
# Trees
max_leaves: int = 3,
# Overall
n_jobs: Optional[int] = -2,
random_state: Optional[int] = None,
# Differential Privacy
epsilon: float = 1.0,
delta: float = 1e-5,
composition: str = "gdp",
bin_budget_frac: float = 0.1,
privacy_bounds: Optional[
Union[np.ndarray, Mapping[Union[int, str], Tuple[float, float]]]
] = None,
privacy_target_min: Optional[float] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to preserve the parameter ordering of the __init__ functions if some of the attributes are defined in the EBMBase, EBMModel, and ExplainableBoostingClassifier/ExplainableBoostingRegressor classes while still using the dataclass decorators? It looks by default like the parameters will be grouped by inheritance ordering, so all the EBMBase attributes would come first, then all the EBMModel attributes, etc..

@paulbkoch paulbkoch marked this pull request as draft March 13, 2024 08:51
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 13, 2024

Hi @DerWeh -- I just noticed something a bit concerning. I downloaded the documentation produced from the PR and the documentation for the ExplainableBoostingClassifier and ExplainableBoostingRegressor is missing. Not sure what the problem is yet.

The docs are available to download here: https://dev.azure.com/ms/interpret/_build/results?buildId=552050&view=artifacts&pathAsName=false&type=publishedArtifacts

We use Jupyter Book to make our documentation which sits on top of Sphinx. The ExplainableBoostingClassifier is documented here: https://github.com/interpretml/interpret/blob/develop/docs/interpret/python/api/ExplainableBoostingClassifier.ipynb

I'll have a look.

Sorry, never really used Notebooks to generate a documentation, and I am struggling a bit with the Azure pipeline.
However, when I tried to build the documentation locally, the ExplainableBoostingClassifier documentation is empty. Is this your issue? Locally, the problem is that requirements.txt pins typing_extensions==4.5.0. This is not compatible with pydantic causing the building of the documentation to fail. If I update typing_extensions, everything seems to be fine.

Where does the pinning come from? Can I just update the package?

@paulbkoch
Copy link
Collaborator

Sorry, never really used Notebooks to generate a documentation, and I am struggling a bit with the Azure pipeline. However, when I tried to build the documentation locally, the ExplainableBoostingClassifier documentation is empty. Is this your issue? Locally, the problem is that requirements.txt pins typing_extensions==4.5.0. This is not compatible with pydantic causing the building of the documentation to fail. If I update typing_extensions, everything seems to be fine.

Where does the pinning come from? Can I just update the package?

Yes, just update the pinning if necessary. I think this is pinned because Jupyter Book changes format frequently and pinning preserves documentation consistency, but we'll need to update these periodically.

@paulbkoch
Copy link
Collaborator

If we dropped python 3.8 support, we could eliminate the typing_extensions dependency, right? Seems like a good tradeoff.

@paulbkoch
Copy link
Collaborator

I've updated the develop branch pipelines to use python 3.9 for building the docs (and other build tasks).

@DerWeh
Copy link
Contributor Author

DerWeh commented Mar 20, 2024

Sorry for the wait, didn't manage to spend much time on it the last week.

Indeed, we only use Annotated from typing_extensions, which is available in the standard library typing 3.9.

However, the import is hard-coded, so I expect, that typing_extensions needs to be installed any which ways. Surprisingly, the docs build locally on my machine using Python3.9 on the requirements.txt file.

Just googled a little, if you want to still support Python3.8 and keep the dependencies as slim as possible, there is an option to make dependencies specific for a python version.

setup(
    ...,
    install_requires=[
        "typing_extensions;python_version<'3.9'",
    ],
)

For course, this would complicate the source-code:

import sys
...
if sys.version_info < (3, 9):
    from typing_extensions import Annotated
else:
    from typing import Annotated

Your thoughts @paulbkoch?


I hope that I can resolve the conflicts with interpret 0.6.0 the next days and continue on this PR.

@paulbkoch
Copy link
Collaborator

Hi @DerWeh -- I'm not too fussy on this, but I think of the options given I'd have a preference towards keeping things simple and dropping support for 3.8. Anyone who needs python 3.8 can just stick with interpret 0.6.0 until support for 3.8 is officially ended in a few months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants