Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple output regression #2087

Closed
miguelmartin75 opened this issue Mar 8, 2017 · 55 comments
Closed

Multiple output regression #2087

miguelmartin75 opened this issue Mar 8, 2017 · 55 comments
Assignees
Projects

Comments

@miguelmartin75
Copy link

miguelmartin75 commented Mar 8, 2017

How do I perform multiple output regression? Or is it simply not possible?

My current assumption is that I would have to modify the code-base such that XGMatrix supports a matrix as labels and that I would have to create a custom objective function.

My end goal would be to perform regression to output two variables (a point) and to optimise euclidean loss. Would I be better off to make two seperate models (one for x coordinates and one for y coordinates).

Or... would I be better off using a random forest regressor within sklearn or some other alternative algorithm?

@khotilov
Copy link
Member

Multivariate/multilabel regression is not currently implemented #574 #680
Tianqi had added some relevant placeholder data structures to gbtree learner, but no one had time, I guess, to work the machinery out.

@jindongwang
Copy link

Pity, since many competitions are with multi-outputs

@MarkusBonsch
Copy link

This would be a really nice feature to have.

@joel-thomas-wilson
Copy link

Do we have any updates on this?

@hcho3
Copy link
Collaborator

hcho3 commented Sep 7, 2018

I'm adding this feature to the feature request tracker: #3439. Hopefully, we can get to it some point.

@JacobKempster
Copy link

I agree - this feature would be extremely valuable (exactly what I need right now...)

@lenselinkbart
Copy link

I also agree, while this is quite trivial to do in neural nets, it would be nice to also be able to do this in xgboost.

@cp9612
Copy link

cp9612 commented Mar 26, 2019

Would like to see this feature coming

@veonua
Copy link

veonua commented Apr 15, 2019

any reason why it is closed?

@hcho3
Copy link
Collaborator

hcho3 commented Apr 15, 2019

@veonua See #3439.

@loretoparisi
Copy link

loretoparisi commented Sep 24, 2019

In the meanwhile there is any alternative, like any ensemble of single output models like:

# Fit a model and predict the lens values from the original features
model = XGBRegressor(n_estimators=2000, max_depth=20, learning_rate=0.01)
model = multioutput.MultiOutputRegressor(model)
model.fit(X_train, X_lens_train)
preds = model.predict(X_test)

from: https://gist.github.com/MLWave/4a3f8b0fee43d45646cf118bda4d202a

@jimmywan
Copy link

In the meanwhile there is any alternative, like any ensemble of single output models like:

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html

@cmottet
Copy link

cmottet commented Jan 22, 2020

I am going to also weigh in and say that having such feature would be extremely handy. The MultiOutputRegressor mentioned above is a nice wrapper to build multiple models at once and it does work well for predicting target variables that are independent from one another. However, if the target variables are highly correlated, then you really want to build one model that predicts a vector.

@MxNl
Copy link

MxNl commented Jan 7, 2021

A year has passed soon since the last comment :-). This is why I want to repeat the wish to have such an interesting feature. I would be happy to see this. Thanks anyway for all your work.

@hcho3 hcho3 reopened this Jan 21, 2021
@hcho3
Copy link
Collaborator

hcho3 commented Jan 21, 2021

Reopening for visibility.

@kk26269
Copy link

kk26269 commented Feb 4, 2021

Multivariate/multilabel regression is not currently implemented #574 #680
Tianqi had added some relevant placeholder data structures to gbtree learner, but no one had time, I guess, to work the machinery out.

Hello ,I have used the SckitLearn estimator and passed my script (.py)written for multioutput regression to the same and I could create endpoints.
I have reffered following repo.
https://github.com/qlanners/ml_deploy/tree/master/aws/scikit-learn/sklearn_estimators_locally.
Changes done are :

Y= dataset.iloc[:,-3:]
X = dataset.iloc[:,:-3]
X_train, X_test , Y_train , Y_test =train_test_split(X, Y , test_size = 0.20,random_state =100)

gbr = GradientBoostingRegressor()
modelMOR = MultiOutputRegressor(estimator=gbr)
modelMOR.fit(X_train, Y_train)

@mirik123
Copy link

mirik123 commented Jul 22, 2021

The MultiOutputRegressor is bad alternative because it doesn't update eval_set dataset together with the main train (X, y) dataset.

@trivialfis
Copy link
Member

I would love to spend some time on this ...

@loretoparisi
Copy link

I would love to spend some time on this ...

I have used this approach and it seems to work fine

#2087 (comment)

@StatMixedML
Copy link

Is there any update on this? Can we make it a joint effort to have the multioutput regression available. Irrespective of the independence modelling of several responses/y-variables, it would be great to have the xgb.DMatrix accept a list or a np.array with shape >1.

@trivialfis trivialfis self-assigned this Sep 14, 2021
@jameslamb
Copy link
Contributor

To be honest, I also am not sure whether it's exactly the same. But it should be similar, right? Whether you are working on multiple tasks like "regression and classification" or multiple targets like "regression predicting y_1 and y_2", you still are in a situation like "find splits that balance gain across multiple loss functions".

To be honest, I haven't read the paper and am not planning to actively work on this (we have many other higher priorities in LightGBM right now).

@StatMixedML
Copy link

@jameslamb

To be honest, I haven't read the paper and am not planning to actively work on this (we have many other higher priorities in LightGBM right now).

Sure, I understand that. I am not sure I find the time either. So maybe let's pause this and see if the community picks it up.

@trivialfis
Copy link
Member

Did a quick scan over a couple of papers. I don't have a good understanding of various algorithms yet, but vector leaf seems to be the essential component of all proposed methods. I will try to prioritize it and share a roadmap for a path forward.

@StatMixedML
Copy link

@trivialfis Ok nice! Looking forward

@StatMixedML
Copy link

@trivialfis This might be an interesting approach to incorporate into XGBoost

SketchBoost: Fast Gradient Boosted Decision Tree for Multioutput Problems

The paper says

Moreover, the proposed methods are easy to implement upon modern boosting frameworks such as XGBoost

You find the code here https://github.com/sb-ai-lab/Py-Boost

@trivialfis
Copy link
Member

@StatMixedML Thank you for the references. Here's an early version of vector-leaf: #8616 No specific optimization yet.

@StatMixedML
Copy link

@trivialfis Very nice, I`ll have a look into it

@trivialfis
Copy link
Member

I will cleanup the code in the coming days. There are some known issues that break the existing code, the only thing that's working is the demo at the moment. It's for discussion and far from ready.

@StatMixedML
Copy link

@trivialfis Sure, take your time. Let me know once I can use it.

Looking forward to it!

@StatMixedML
Copy link

@trivialfis I have seen that you created a PR for a first version of the multi-target tree. This is awesome!!

Let me know once I can test it. Would be great to run some examples and compare accuracy and runtime. Willing to volunteer on this!

@lcrmorin
Copy link

lcrmorin commented Feb 9, 2023

I am currently trying this. Should I expect any performance / memory gain over tuning multiple models ?

@trivialfis
Copy link
Member

Hi @StatMixedML @lcrmorin, thank you for volunteering! The PR is not ready yet, I still need to figure out some parts of the parameter interface and do more tests. If you really want to try to code, the demo/guide-python/multioutput_regression.py would be a good starting place, See the rmse_model function and parameters used in there.

@StatMixedML
Copy link

StatMixedML commented Feb 10, 2023

@lcrmorin So the advantage of using multi-output models is that you don't have to train a separate model for each response variable. Also, as outlined in Multi-Target XGBoostLSS Regression, you can model dependencies between the different responses. What @trivialfis is currently working on is to speed-up the estimation using multi-target trees to better and efficiently scale to multiple response variables.

@loretoparisi
Copy link

@lcrmorin
Copy link

So that would also help inference time, right ?

@StatMixedML
Copy link

@lcrmorin We would expect to see the highest efficiency gains during training time, especially for HPO / cross-validation.

@trivialfis trivialfis moved this from Need prioritize to 2.0 In Progress in 2.0 Roadmap Mar 17, 2023
@trivialfis
Copy link
Member

The bare bone implementation is merged, please help test out. :-)
You can find reference for nightly build in xgboost's Python installation document. The computation performance is not yet optimized, please expect some quirks.

@trivialfis
Copy link
Member

The link will be available once the CI passes

@trivialfis trivialfis moved this from 2.0 In Progress to 2.0 Done in 2.0 Roadmap Mar 22, 2023
@trivialfis
Copy link
Member

A bug fix PR for prediction along with a small optimization: #8968 .

@hcho3 hcho3 pinned this issue Mar 29, 2023
@trivialfis trivialfis unpinned this issue Mar 30, 2023
@lcrmorin
Copy link

Just being curious; would this allows some missing / masked targets ? (I have multi-targets time-series applications in mind, where longer horizons targets are not available immediatly).

@trivialfis
Copy link
Member

Not planned at the moment, the label is required to be dense. But I will mark that as a feature request and see if we can find a way to train boosting tree models with missing labels.

@trivialfis
Copy link
Member

Hi all, thank you for joining the discussion and the helpful feedbacks! Let's continue the discussion in #9043 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
2.0 Roadmap
  
2.0 Done
Development

No branches or pull requests