Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Callbacks base infrastructure + progress bars #27663

Open
wants to merge 57 commits into
base: callbacks
Choose a base branch
from

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Oct 25, 2023

Extracted from #22000

This PR implements a smaller portion of #22000, with only the base infrastructure for callbacks and the implementation of a single callback (progress bars). It targets the callbacks branch and the goal is to have the full callback implementation done in several smaller PRs merged in this branch before we merge it in main.

This PR proposes significant changes compared to #22000 that should imo improve a lot its change of getting merged 😄
The main improvement is that it no longer requires writing on the disk, but instead relies on multiprocessing.Managers and queues. It simplifies the code a lot.

In #22000 I adapted some estimators to work with the callbacks which I did not include here to keep the PR as light as possible. You can however experiment the callbacks on estimators that I wrote for testing purpose:

from sklearn.callback import ProgressBar
from sklearn.callback.tests._utils import Estimator, MetaEstimator
est = Estimator()
meta_est = MetaEstimator(est, n_jobs=2)
meta_est._set_callbacks(ProgressBar())
meta_est.fit(None, None)

You add sleep calls in these testing estimators to simulate how it changes when the computations take longer.

The plan is to then have several PRs to implement the other callbacks, adapt a few estimators to work with the callbacks, add some documentation and examples, add more tests...

estimator=self,
node=root.children[i],
):
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to cover this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be covered with the EarlyStopping callback

@jeremiedbb
Copy link
Member Author

Hi @rth, I think this is getting ready for a second review :)
This PR introduces the base infrastructure for callbacks and a first callback for progress bars. Next steps are detailed in #27676

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in my side. @rth please ignore all the lock file changes. this does not require any attention.

@jondo jondo mentioned this pull request Nov 29, 2023
4 tasks
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very sorry for the response time @jeremiedbb . Hopefully I fixed my emails filters to see mentions from scikit-learn .

I read it though once, and the main thing I don't understand is how do you compute levels dict. I would have expected the build_computation_tree to build the computation tree, but actually it takes levels as parameters which is already some form of that tree.

Say you have a ColumnTransformer with 2 pipelines inside first with 2 steps, the other one with 3 steps. How do you compute the computation tree in that case?

sklearn/base.py Outdated
@@ -115,6 +116,10 @@ def _clone_parametrized(estimator, *, safe=True):

params_set = new_object.get_params(deep=False)

# attach callbacks to the new estimator
if hasattr(estimator, "_callbacks"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this might break something for https://github.com/microsoft/FLAML/blob/0415638dd1e1d3149fb17fb8760520af975d16f6/flaml/automl/model.py#L1587 which also adds this attribute to scikit-learn's base estimator in their library. But there is no reserved private namespaces, so probably they could adapt if it does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also break quite easily if the callback object keeps references to attributes of the old estimator. Why aren't we creating a copy here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this might break something for https://github.com/microsoft/FLAML/blob/0415638dd1e1d3149fb17fb8760520af975d16f6/flaml/automl/model.py#L1587 which also adds this attribute to scikit-learn's base estimator in their library

I can change the name to _sk_callbacks or any derivative of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we creating a copy here?

The motivation is a use case like this

monitoring = Monitoring()
lr = LogisitcRegression()._set_callbacks(monitoring)
GridSearchCV(lr, param_grid).fit(X,y)
monitoring.plot()

The monitoring callback will gather information across all copies of logistic regression made in the grid search. If we made a copy of the callback in clone, then we couldn't retrieve any information once the grid search is finished.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The object itself can disable copy by implementing __copy__ and __deep_copy__, and then they would be in the space of "we know what we're doing" and we don't need to worry about it.

I think the kind of state you're referring to here, is something which can be outside the callback object, like a file / a database / an external singleton object, and the callback method just writes into that storage, and at the end one can use that data to plot/investigate/etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the kind of state you're referring to here, is something which can be outside the callback object, like a file / a database / an external singleton object

good point and actually in my latest design of the callbacks I didn't rely on a shared state so you can ignore my previous comment :) And we can't get around copies anyway since the clones can happen in subprocesses (in a grid search for instance). I updated to clone the callbacks as any other param

sklearn/base.py Outdated
"""
if hasattr(sub_estimator, "_callbacks") and any(
callback.auto_propagate for callback in sub_estimator._callbacks
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you call this method twice? Wouldn't this be raised on the second run. If so it feels a bit fragile. What's the harm of not checking this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should never be called twice. The meta estimator only propagates its callbacks once to its sub-estimator(s).

This error is important to have an informative error message if a user tries something like

lr = LogisiticRegression()._set_callbacks(ProgressBar)
GridSearchCV(lr, param_grid)

It would crash without telling the user what's the right way to do it.

Comment on lines 81 to 84
- descr: str
A description of the level
- max_iter: int or None
The number of its children. None means it's a leaf.
Copy link
Member

@rth rth Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait so max_iter has nothing to do with the number of iterations of an algorithm. Why is it called that way then.

Also if it's expected to have only a few keys, typing wise might have been better to make it a dataclass or something rather than a dict, but no strong opinion.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very shallow review.

sklearn/base.py Outdated
@@ -115,6 +116,10 @@ def _clone_parametrized(estimator, *, safe=True):

params_set = new_object.get_params(deep=False)

# attach callbacks to the new estimator
if hasattr(estimator, "_callbacks"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also break quite easily if the callback object keeps references to attributes of the old estimator. Why aren't we creating a copy here?

sklearn/base.py Show resolved Hide resolved
sklearn/base.py Outdated Show resolved Hide resolved
Comment on lines +1341 to +1344
try:
return fit_method(estimator, *args, **kwargs)
finally:
estimator._eval_callbacks_on_fit_end()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is becoming larger than just a validation wrapper. We can simplify debugging and magic by having a BaseEstimator.fit which calls self._fit(...) and does all the common stuff before and after. That seems a lot better to understand and debug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation when I introduced _fit_context was not only for validation but to have a generic context manager to handle everything we need to do before and after fit. That's why I gave it this generic name.

Although having a consistent framework where we'd have a BaseEstimator.fit and every child estimator implements _fit is appealing, I think it goes far beyond the scope of this PR and requires rewriting a lot of estimators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw do you why BaseEstimator does not implement fit in the first place ?

Also, note that _fit_context also handles partial_fit, but I don't think we want BaseEstimator to implement partial_fit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseEstimator doesn't implement fit cause we don't generally have methods which raise NotImplementedError. They're simply not there. But now that we have all this work, we can certainly have it in BaseEstimator, and children only implement a __sklearn_fit__ kind of method instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's outside the scope of this PR. Using the existing context manager is just 1 line addition whereas implementing __sklearn_fit__ means countless PRs :)

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
@rth
Copy link
Member

rth commented Mar 5, 2024

+1 to merge this. I personally find that it would be more valuable to mark it as experimental (it's private anyway so far) let the users use this for a version or two, aggregate their feedback and iterate in future details if needed.

Overall the API sounds reasonable to me. Rather than to approve a SLEP on this, only then to realize that users would have preferred something else, or that there are some weird edge cases for some estimator that need special handling.

@jeremiedbb
Copy link
Member Author

Note that this PRis targeting the callbacks branch, not main so merging this is not a big commitment anyway 😄
And it would allow me to implement the rest that I did not include in this PR to keep it as small as possible.

I personally find that it would be more valuable to mark it as experimental

I agree and we already discussed that with @glemaitre. I plan to do that in a follow up PR.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Mar 5, 2024

I still need to figure out the issue with the CI though
EDIT: good now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants