Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC Add example showcasing HGBT regression #26991

Merged
merged 67 commits into from Feb 22, 2024

Conversation

ArturoAmorQ
Copy link
Member

@ArturoAmorQ ArturoAmorQ commented Aug 2, 2023

Reference Issues/PRs

Fixes #26826. See also #21967 and #23746 on missing values documentation.

What does this implement/fix? Explain your changes.

This PR adds an example to:

  • replace the landing-page figure by a simple didactic plot
  • showcase HGBT nice features such as:
    • Quantile regression
    • Support of missing values
    • Monotonicity constraints
  • be cross-linked in the documentation
  • be cross-linked in other examples

Any other comments?

The original issue suggests also demoing support of categorical values, but we already have Categorical Feature Support in Gradient Boosting, which is only linked in the present example as it is a very good example itself.

Indeed, we also have a Monotonic constraints example but it can be merged with the example from this PR.

@github-actions
Copy link

github-actions bot commented Aug 2, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ea70999. Link to the linter CI: here

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first iteration.

examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
# %%
# Notice energy transfer increases systematically during weekends.
#
# Effect of number of trees in HistGradientBoostingRegressor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we should change this section to highlight early stopping and finding a good pair of (max_iter, learning_rate).
I consider this important as it is shown nowhere and usually the first step to train a HGBT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need help here, just say so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we should change this section to highlight early stopping and finding a good pair of (max_iter, learning_rate)

I guess you mean by doing a grid search over those 2 parameters and exploring the respective n_iter_ attribute, right? Or what do you have in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grid search is overkill, even not so helpful. Just choose one learning rate, fit the HBGT on the training set and determine the best max_iter via early stopping and some explicit validation_fraction. From there on, use this max_iter and don't use early stopping anymore.
Once, this PR and #27124 are merged, we can add cross validation for this step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into the issue reported in #25460, as early_stopping uses shuffle=True for the internal validation, which is not the right thing to do when dealing with time series, as done in this example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. But for the time being, it is better to "learn" max iter with the current implementation and early stopping than to rely on defaults.
After all, max iter and the learning rate are the most important parameters.

I would just add a comment that it is not optimal for time series ATM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9a486b8.

examples/ensemble/plot_hgbt_regression.py Show resolved Hide resolved
examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
examples/ensemble/plot_hgbt_regression.py Show resolved Hide resolved
examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
@ArturoAmorQ
Copy link
Member Author

I think I have addressed all the comments. Please let me know otherwise.

@glemaitre glemaitre self-requested a review January 22, 2024 09:59
ArturoAmorQ and others added 4 commits January 23, 2024 10:50
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks already good to me. I'm just wondering that we don't show the automatic way of dealing with categorical data. The variable day is actually a "category" variable. I'm wondering if we could not simplify modify the code and mention that we have this new way to detecting and encoding the categorical variable internally.

_ = ax.legend()

# %%
# With just a few iterations, HGBT models can achieve convergence (see
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a sentence to describe what we see on the figure. 5 iterations is not enough to be able to predict. But at 50, we are already able to do a good job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable day is actually a "category" variable.

The thing is that day is already ordinal encoded, so I don't think we can demo the categorical variables support here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not really matter. Since this is a category columns, the tree will treat it as such.

A potential thing that we could do is to replace the integers by the day as string at the beginning. I don't know if eventually it removes some boilerplate when plotting because we already have the name instead of remapping the integer to the days.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This said, I would be happy to merge this PR as-is because it is a net improvement and see if we can further improve it with this increment.

examples/ensemble/plot_hgbt_regression.py Outdated Show resolved Hide resolved
@ArturoAmorQ
Copy link
Member Author

This said, I would be happy to merge this PR as-is because it is a net improvement and see if we can further improve it with this increment.

Should I push my own green button?

@lorentzenchr
Copy link
Member

lorentzenchr commented Feb 21, 2024

Should I push my own green button?

No. @glemaitre you did the latest review. Fine to merge?

@glemaitre
Copy link
Member

Yes. Fine To merge. @lorentzenchr do you have further changes (just because the status is not approved on your side).

@lorentzenchr
Copy link
Member

the status is not approved on your side

I approved a month ago and then you requested a review by me.
I‘ll merge.

@lorentzenchr lorentzenchr merged commit c826fec into scikit-learn:main Feb 22, 2024
28 checks passed
@glemaitre
Copy link
Member

glemaitre commented Feb 22, 2024 via email

@GaelVaroquaux
Copy link
Member

Very happy about this! Congratulation to every one involved.

Nitpick: I would remove "advanced" from the title, as I think that many people should look at this example, and not only advanced people.

If y'all agree, I'll send a PR that does nothing else than remove this word 😀 😀

@lorentzenchr
Copy link
Member

@GaelVaroquaux Please go on and ping me for quick review if you‘d like.

@GaelVaroquaux
Copy link
Member

Here it is #28508

I was not true to my words :$ I removed a tiny bit more than "advanced" :)

Thanks!!

@ArturoAmorQ ArturoAmorQ deleted the hgbt_new_example branch February 27, 2024 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an example showcasing the HistGradientBoosting...
4 participants