[WIP] Callback API continued #22000

jeremiedbb · 2021-12-16T19:23:31Z

Fixes #78 #7574 #10973
Continuation of the work started in #16925 by @rth.

Goal

The goal of this PR is to propose a callback API that can handle the most important / asked use cases.

Convergence monitoring (IterativeImputer not converging (at all) #14338 DictionaryLearning should support more general constraints on the components #8994 (comment))
Monitor some quantities / metrics at each iteration.
This can also be very useful for maintenance / debugging / implementation of new features.
Early stopping (Feature discussion: callbacks for long operations #10973)
Allow to stop the iterations based on some external metric evaluated on a validation set.
Progress bars ([Feature Request] Progress Bars #7574 Use python logging to report on convergence progress it level info for long running tasks #78 Feature discussion: callbacks for long operations #10973)
Logging / Verbose (Use python logging to report on convergence progress it level info for long running tasks #78)
Snapshots
Take regular snapshots of an estimator during the fit to be able to recover a working estimator if the fit is somehow interrupted for instance.
Resource monitoring
(not implemented yet)

Challenges

Supporting all these features and make each of these callbacks available is not easy and will require some refactoring in probably many estimators.

The proposed API makes it possible to enable the callbacks 1 estimator at a time: Setting callbacks on non yet supported estimators has no effect. Thus we can then incrementally do it in subsequent dedicated PRs. Here I only did NMF, LogisticRegression and Pipeline to show what are the necessary changes in the code base.

The proposed API also makes it possible to only enable a subset of the features for an estimator, and add the remaining ones later. For LogisticRegression I only passed the minimum for instance.
Callbacks should not impact the performance of the estimators. Some quantities passed to the callbacks might be costly to compute. We don't want to spend time computing them if the only callback is a progress bar for instance.

The solution I found is to do a lazy evaluation using lambdas and only actually compute them if there's at least 1 callback requesting it. For now callbacks can request these by defining specific class attributes but maybe there's a better way. mixins ?
The callbacks described above are not meant to be evaluated a the same fitting step of an estimator.

When an estimator has several nested loops (LogisticRegressionCV(multiclass="ovr") for instance has a loop over Cs, a loop over the classes and then the final loop for the iterations on the dataset), the snapshot callback can only be evaluated at the end of an outermost loop while the EarlyStopping would be evaluated at the end of an innermost loop, and the ProgressBar could be evaluated at each level of nesting.

In this PR I propose that each estimator holds a computation tree as a private attribute representing these nested loops, the root being the beginning of fit and each node being one step of a loop. This structure is defined in _computation_tree.py. It allows to have a simple way to know exactly at which step of the fit we are at each evaluation of the callbacks and is kind of the best solution I found to solve the challenges described below. This imposes the main changes to the code base, i.e. passing the parent node around.
Dealing with parallelism and especially multiprocessing is the main challenge to me.

Typically with a callback you might want to accumulate a bunch of info during fit and recover them at the end. The issue is that the callback is not shared between sub-processes and modifying its state in a sub-process (e.g. modifying an attribute) will not be visible from the main process. The joblib API doesn't allow inter-process communication that would be needed to overcome this.

The solution we found is that the callbacks write the information they want to keep in files (in files in this first implementation but we might consider sockets or another solution ?). It's relatively easy to avoid race conditions with this design.
As an example this is necessary to be able to report progress in real time. In an estimator running in parallel, there's no like current computation node. We are at different nodes at the same time. But having the status of each node in a file updated at each call to the callbacks allows to know the current overall progress from the main process. (there are other difficulties described later).
The last main challenge is meta-estimators. We'd like some callbacks to be set on the meta-estimator, like progress bars, but some others to be set on the underlying estimator(s), like early-stopping. Moreover we encounter the parallelism issue again if the meta-estimator is fitting clones of the underlying estimator in parallel, like GS.

For that, I propose to have a mixin to tell a callback that it should be propagated to sub estimators. This way the meta-estimator will only propagate the appropriate callbacks to its sub-estimators, and these sub-estimators can also have normal callbacks.

The API

This PR adds a new module sklearn.callback which exposes BaseCallback, the abstract base class for the callbacks. All callbacks must inherit from BaseCallback. It also exposes AutoPropagatedMixin. Callbacks that should be propagated to sub-estimators by meta-estimators must inherit from this.

BaseCallback has 3 abstract methods:

on_fit_end. Called at the beginning of fit, after all validations. We pass a reference to the estimator, X_train and y_train.
on_fit_iter_end. Called at the end of each node of the computation tree, i.e. each step of each nested loop. We pass a reference to the estimator (which at this point might be different from the one passed at in_fit_begin for propagated callbacks), and the computation node where it was called. We also pass some of these:
- stopping_criterion: when the estimator has a stopping criterion such that the iterations stop when stopping_criterion <= tol.
- tol: tolerance for the stopping criterion.
- reconstruction_attributes. These are the necessary attributes to construct an estimator (by copying the estimator and setting these as attributes) which will behave as if the fit stopped at this node. Then we must be able to call predict, transform, ...
- fit_state: Model specific quantities updated during fit. This is not meant to be used by generic callbacks but by a callback designed for a specific estimator instead. This arg is not used in any of the use cases described above but I thinkit's important to have for custom callbacks. It's the role of each estimator to decide what is interesting to pass to the callback. We could later think of new field in the docstring of the estimators to describe what keys they pass to this arg.
on_fit_end. Called at the end of fit. Takes no argument. It allows the callback to do some clean-up.

Examples

Progress bars.

expand

Here's an example of progress monitoring using rich. I used custom estimators to simulate a complex setting with a meta-estimator (like a GridSearchCV) running in parallel with a sub-estimator also running in parallel.

simplescreenrecorder-2021-12-16_19.17.35.mp4

Convergence Monitoring

expand

from sklearn.decomposition import NMF
import numpy as np
X = np.random.random_sample((1100, 100))
X_val = X[-100:]
nmf = NMF(n_components=20, solver="mu")
callback = ConvergenceMonitor(X_val=X_val)
nmf._set_callbacks(callback)
nmf.fit(X[:1000])
callback.plot()

Snapshot

expand

from sklearn.decomposition import NMF
import numpy as np
X = np.random.random_sample((1100, 100))
nmf = NMF(n_components=20, solver="mu")
callback = Snapshot()
nmf._set_callbacks(callback)
nmf.fit(X[:1000])
# interrupt fit. Ctrl-C for instance
# [...]
KeyboardInterrupt:

import pickle
with open(callback.directory / "2021-12-16_19-33-15-083014.pkl", "rb") as f:
    new_nmf = pickle.load(f)
W = new_nmf.transform(X[-100:])

EarlyStopping

expand

If the on_fit_iter_end method of the callbacks returns True, the iteration loop breaks.

from sklearn.decomposition import NMF
import numpy as np
X = np.random.random_sample((1100, 100))
X_val = X[-100:]
nmf = NMF(n_components=20, solver="mu")
callback = EarlyStopping(monitor="objective_function", X_val=X_val, max_no_improvement=10, tol=1e-4)
nmf._set_callbacks(callback)
nmf.fit(X[:1000])

Verbose

expand

from sklearn.decomposition import NMF
import numpy as np
X = np.random.random_sample((1100, 100))
nmf = NMF(n_components=20, solver="mu", max_iter=20)
nmf._set_callbacks(TextVerbose())
nmf.fit(X)

[NMF] iter 0 | time 0.02493s | stopping_criterion=8.730E-01 | tol=1.000E-04
[NMF] iter 1 | time 0.02634s | stopping_criterion=8.737E-01 | tol=1.000E-04
[NMF] iter 2 | time 0.02768s | stopping_criterion=8.743E-01 | tol=1.000E-04
[NMF] iter 3 | time 0.02893s | stopping_criterion=8.749E-01 | tol=1.000E-04
[NMF] iter 4 | time 0.03016s | stopping_criterion=8.755E-01 | tol=1.000E-04
[NMF] iter 5 | time 0.03136s | stopping_criterion=8.760E-01 | tol=1.000E-04
[NMF] iter 6 | time 0.03255s | stopping_criterion=8.766E-01 | tol=1.000E-04
[NMF] iter 7 | time 0.03375s | stopping_criterion=8.772E-01 | tol=1.000E-04
[NMF] iter 8 | time 0.03496s | stopping_criterion=8.777E-01 | tol=1.000E-04
[NMF] iter 9 | time 0.03691s | stopping_criterion=8.782E-01 | tol=1.000E-04
[NMF] iter 10 | time 0.03841s | stopping_criterion=5.307E-04 | tol=1.000E-04
[NMF] iter 11 | time 0.03966s | stopping_criterion=1.049E-03 | tol=1.000E-04
[NMF] iter 12 | time 0.04087s | stopping_criterion=1.552E-03 | tol=1.000E-04
[NMF] iter 13 | time 0.04209s | stopping_criterion=2.036E-03 | tol=1.000E-04
[NMF] iter 14 | time 0.04327s | stopping_criterion=2.498E-03 | tol=1.000E-04
[NMF] iter 15 | time 0.04447s | stopping_criterion=2.936E-03 | tol=1.000E-04
[NMF] iter 16 | time 0.04565s | stopping_criterion=3.349E-03 | tol=1.000E-04
[NMF] iter 17 | time 0.04686s | stopping_criterion=3.734E-03 | tol=1.000E-04
[NMF] iter 18 | time 0.04804s | stopping_criterion=4.093E-03 | tol=1.000E-04
[NMF] iter 19 | time 0.04923s | stopping_criterion=4.425E-03 | tol=1.000E-04

TODO

This PR is still WIP.

It's missing all the documentation of the callback module to describe the API and how to use and write callbacks and an example.
I started adding tests for the computation tree but we need more, and I still need to add test for the callback api and tests for each of the implemented callbacks.
Finalize and Document the implemented callbacks. There are still a few issues that need to be fixed in these callbacks.
Think about how callbacks should be reinitialized when reused, like refitting an estimator.

ogrisel · 2021-12-17T09:30:37Z

Thanks! Another use case I see is structured logging: instead of generating lines in a text file, generate an event log in json file, records in a database (e.g. MongoDB or PostgreSQL, possibly via a JSON column type), a Kafka stream or an ML specific with ML tracking platforms, for instance MLFlow tracking features or weights and biases' wandb.log.

thomasjpfan

Thanks for working on this!

thomasjpfan · 2021-12-17T15:26:23Z

sklearn/linear_model/_sag_fast.pyx.tp

@@ -515,6 +518,22 @@ def sag{{name_suffix}}(SequentialDataset{{name_suffix}} dataset,
                                  fabs(weights[idx] -
                                       previous_weights[idx]))
                previous_weights[idx] = weights[idx]
+
+            with gil:
+                if _eval_callbacks_on_fit_iter_end(


How does the overhead of taking the GIL compare to early stopping directly using the stopping_criteron?

It has an impact on performance for sure. But If we want to enable callbacks at this step of the fit there's no way around.
What we can do however is to check before entering the nogil part if the estimator has callbacks and execute this part only if it's the case. Let me try something like that. We might encounter the same issue as in #13389

sklearn/callback/_computation_tree.py

thomasjpfan · 2021-12-17T15:50:19Z

sklearn/callback/_progressbar.py

+            else:
+                # node is a leaf, look for tasks of its sub computation tree before
+                # going to the next node
+                child_dir = this_dir / str(node.tree_status_idx)


I think we should abstract away the filesystem that backs the computation trees because:

I do not think we want third party developers writing Callbacks to worry about the filesystem.

It will be easier to switch to another inter-process communication method in the future.

That's probably better yes. I'll try to come up with a more friendly solution

glemaitre · 2021-12-19T11:00:22Z

sklearn/callback/_computation_tree.py

@@ -0,0 +1,268 @@
+# License: BSD 3 clause


@adrinjalali Discussing with @jeremiedbb IRL and while I was explaining to the sample-props PR, he was under the impression that the MetaDataRequest class would be similar to the ComputationTree in some regards. Maybe you could have a look for some inspiration :)

glemaitre · 2021-12-19T11:06:25Z

sklearn/base.py

+        else:
+            sub_estimator._callbacks.extend(propagated_callbacks)
+
+    def _eval_callbacks_on_fit_begin(self, *, levels, X=None, y=None):


Would it be too magical to have only something call _eval_callbacks_begin and inspect internal the stack of calls to infer which method called this function.

Of course it would make sense only if the same methods are expected to be called for fit/predict/etc.

Of course it would make sense only if the same methods are expected to be called for fit/predict/etc.

Well that's not obvious at all and I've not really thought about that. This first iteration is all about fit. I think it will be easier to not try to be too magical for now

chritter · 2021-12-29T17:57:48Z

@jeremiedbb Would this PR cover early stopping for RandomizedSearchCV when using a time budget would be beneficial (e.g. stop after X seconds instead of a iteration limit)? Snapshots seem to apply to a single estimator. Maybe it is out of scope. Thanks!

jeremiedbb · 2021-12-30T10:53:12Z

@chritter For now EarlyStopping based on a time budget in SearchCV estimators doesn't seem possible due to joblib (it might be possible at some point if the possibility to return a generator is merged joblib/joblib#588)

adampl · 2022-09-22T14:59:54Z

@jeremiedbb What is the current status of this feature? Is it abandoned? :(

jeremiedbb · 2022-09-22T15:02:27Z

Is it abandoned? :(

No it's not :) I haven't been working on it for some time but I started working on it again a few weeks ago. There's still a lot work to do though

ogrisel · 2022-09-22T16:29:01Z

Maybe you could keep this WIP branch up to date ;)

github-actions · 2023-10-18T10:12:22Z

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`mypy`

mypy detected issues. Please fix them locally and push the changes. Here you can see the detected issues. Note that the installed mypy version is mypy=1.3.0.


sklearn/externals/_arff.py:782: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
sklearn/callback/tests/_utils.py:37: error: Need type annotation for "_parameter_constraints" (hint: "_parameter_constraints: Dict[<type>, <type>] = ...")  [var-annotated]
sklearn/callback/tests/_utils.py:74: error: Need type annotation for "_parameter_constraints" (hint: "_parameter_constraints: Dict[<type>, <type>] = ...")  [var-annotated]
Found 2 errors in 1 file (checked 553 source files)

_{Generated for commit: b8ac1a5. Link to the linter CI: here}

jondo · 2023-11-29T17:00:55Z

Remark: #27663 implements a smaller portion of this.

amueller · 2024-02-09T21:07:28Z

I think I'm -1 on using callbacks for early stopping since I don't see a way of making it work within pipelines.

callback API

272e75f

github-actions bot added the cython label Dec 16, 2021

rth mentioned this pull request Dec 17, 2021

Callbacks API #16925

Closed

5 tasks

thomasjpfan reviewed Dec 17, 2021

View reviewed changes

cln nmf and test reconstruction attributes

584bdf7

glemaitre reviewed Dec 19, 2021

View reviewed changes

cln snapshot + test snapshot + uuid for computation tree

bb32ff3

jeremiedbb added 8 commits December 31, 2021 17:20

cln

7a1825d

black

3e3b25f

lint

26dbb69

wip

eb7b824

Merge branch 'master' into callback-api

9b913fd

class

f78442e

more tests

34bab15

cln

596a58e

ogrisel mentioned this pull request Mar 18, 2022

Path for pluggable low-level computational routines #22438

Open

jeremiedbb mentioned this pull request Apr 19, 2022

Getting ETA for model training #23156

Closed

thomasjpfan mentioned this pull request Jun 30, 2022

Include print counter for HalvingGridSearch in verbose=3 mode #23567

Open

Micky774 mentioned this pull request Sep 2, 2022

Progress bar for BaseSearchCV and its inheritors to track iterative progress more conveniently #24305

Closed

jeremiedbb added 5 commits September 12, 2022 18:29

wip

4f9363c

Merge remote-tracking branch 'upstream/main' into callback-api

030f68b

wip

35c5284

wip

115e184

wip

bdb4990

ogrisel mentioned this pull request Sep 22, 2022

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497

Closed

jeremiedbb added 2 commits September 23, 2022 14:28

Merge remote-tracking branch 'upstream/main' into callback-api

d1bb5eb

wip

7a43c30

jeremiedbb mentioned this pull request Sep 27, 2022

Add TQDM progress bar to .fit #24524

Closed

jeremiedbb added 3 commits October 11, 2022 10:05

Merge remote-tracking branch 'upstream/main' into callback-api

573fd5d

wip

a218068

update poor_score

f794694

ogrisel mentioned this pull request Oct 17, 2022

MLPRegressor - Validation score wrongly defined #24411

Open

ogrisel mentioned this pull request Jan 24, 2023

HistGradientBoosting avoid data shuffling when early_stopping activated #25460

Open

Micky774 mentioned this pull request May 19, 2023

GridSearchCV support callback for MLFlow #26395

Open

This was referenced Jun 7, 2023

Return training loss from LogisticRegression #26494

Open

Add tqdm integration for progress tracking in GridSearchCV #26532

Closed

jeremiedbb added 3 commits June 19, 2023 10:21

Merge remote-tracking branch 'upstream/main' into pr/jeremiedbb/22000

ab74f19

wip

37e569b

wip

d7208fa

glemaitre mentioned this pull request Aug 11, 2023

making mutual_info_regression more verbose #26775

Open

jeremiedbb added 2 commits October 17, 2023 19:42

Merge remote-tracking branch 'upstream/main' into pr/jeremiedbb/22000

774ff69

cln

b8ac1a5

jeremiedbb mentioned this pull request Oct 25, 2023

FEA Callbacks base infrastructure + progress bars #27663

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Callback API continued #22000

[WIP] Callback API continued #22000

jeremiedbb commented Dec 16, 2021 •

edited by ogrisel

ogrisel commented Dec 17, 2021

thomasjpfan left a comment

thomasjpfan Dec 17, 2021

jeremiedbb Dec 17, 2021

thomasjpfan Dec 17, 2021

jeremiedbb Dec 17, 2021

glemaitre Dec 19, 2021

glemaitre Dec 19, 2021

jeremiedbb Dec 19, 2021

chritter commented Dec 29, 2021 •

edited

jeremiedbb commented Dec 30, 2021

adampl commented Sep 22, 2022

jeremiedbb commented Sep 22, 2022

ogrisel commented Sep 22, 2022

github-actions bot commented Oct 18, 2023

jondo commented Nov 29, 2023

amueller commented Feb 9, 2024

[WIP] Callback API continued #22000

Are you sure you want to change the base?

[WIP] Callback API continued #22000

Conversation

jeremiedbb commented Dec 16, 2021 • edited by ogrisel

Goal

Challenges

The API

Examples

TODO

ogrisel commented Dec 17, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Dec 17, 2021

Choose a reason for hiding this comment

jeremiedbb Dec 17, 2021

Choose a reason for hiding this comment

thomasjpfan Dec 17, 2021

Choose a reason for hiding this comment

jeremiedbb Dec 17, 2021

Choose a reason for hiding this comment

glemaitre Dec 19, 2021

Choose a reason for hiding this comment

glemaitre Dec 19, 2021

Choose a reason for hiding this comment

jeremiedbb Dec 19, 2021

Choose a reason for hiding this comment

chritter commented Dec 29, 2021 • edited

jeremiedbb commented Dec 30, 2021

adampl commented Sep 22, 2022

jeremiedbb commented Sep 22, 2022

ogrisel commented Sep 22, 2022

github-actions bot commented Oct 18, 2023

❌ Linting issues

mypy

jondo commented Nov 29, 2023

amueller commented Feb 9, 2024

jeremiedbb commented Dec 16, 2021 •

edited by ogrisel

chritter commented Dec 29, 2021 •

edited

`mypy`