Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is multiple Regression possible with Hyperparameter_hunter #96

Open
Codeguyross opened this issue Nov 16, 2018 · 8 comments
Open

Is multiple Regression possible with Hyperparameter_hunter #96

Codeguyross opened this issue Nov 16, 2018 · 8 comments
Labels
Question Further information is requested

Comments

@Codeguyross
Copy link

I have everything running, but the problem I am looking to solve is a multiple regression problem. Is this possible with hyperparameter_hunter?

@HunterMcGushion
Copy link
Owner

I've never tested multiple regression, actually. But I'm very interested in making sure its supported!

Can you provide a minimal toy example that resembles the shape of the data you'd be working with, and uses the appropriate metrics?

@HunterMcGushion HunterMcGushion added the Question Further information is requested label Nov 16, 2018
@strelzoff-erdc
Copy link

We made you an example of Multiple Regression - both a simple linear example and a much harder non-linear example.

Toy Multiple Regression Notebook

@HunterMcGushion
Copy link
Owner

Wow! Thank you very much for setting up such a clear example! I have a few questions that are going to sound stupid, but I need to ask just to make sure I understand what you need since I haven’t worked with multiple regression before:

  1. The prediction files for multiple regression problems should contain the predictions for ALL target columns, correct? In your first example, that would be [“y1”, “y2”, “y3”, “y4”]
  2. Regarding your scaling with StandardScaler, do you envision doing this to all of the data before providing it to HyperparameterHunter? Or should this be a preprocessing step performed by HyperparameterHunter, during cross-validation, for example?
  3. Am I correct in saying that you need the following to be supported in HyperparameterHunter:
    1. Multiple target_columns for regression (this is already supported, but only tested for multi-classification)
    2. Prediction files that contain predictions for all of the target_columns (also only tested for multi-classification)
    3. Some way to work with sklearn.multioutput.MultiOutputRegressor
      • Is MultiOutputRegressor strictly being used because SVR and XGBRegressor don’t offer native support for multiple targets? In other words, is support for it necessary? If it were not supported, how much functionality would be lost?
    4. Metrics calculation based on multiple target columns

If I am missing any requirements in my third question, please let me know, and if anything I’ve said seems even slightly questionable, I’d appreciate the correction.
Just want to make sure we’re on the same page! Thanks again for your fantastic example!

Unrelated: In your example's first train_test_split call, it should be using X and Y as input, rather than X_scaled and Y_scaled, correct?

@HunterMcGushion
Copy link
Owner

It looks like HyperparameterHunter already works with the regressors in your example that aren't wrapped in MultiOutputRegressor. Would you mind trying out the provided example to verify that it gives you the results you're expecting?

from hyperparameter_hunter import Environment, CrossValidationExperiment

import pandas as pd

from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, Ridge, MultiTaskLasso, MultiTaskElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor

# Trivial Linear Multiple Regression Problem with a little Noise (0.1)
x, y = make_regression(
    n_samples=1000,
    n_features=4,
    n_informative=4,
    n_targets=4,
    noise=0.1,
    random_state=42,
)

#################### Train/Holdout Split ####################
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.10, random_state=42)

#################### Scale Data ####################
x_scaler = StandardScaler()
x_train_scaled = x_scaler.fit_transform(x_train)
x_test_scaled = x_scaler.transform(x_test)

y_scaler = StandardScaler()
y_train_scaled = y_scaler.fit_transform(y_train)
y_test_scaled = y_scaler.transform(y_test)

#################### Reorganize Into Scaled DFs ####################
x_train_df = pd.DataFrame(x_train_scaled, columns=['x1', 'x2', 'x3', 'x4'])
y_train_df = pd.DataFrame(y_train_scaled, columns=['y1', 'y2', 'y3', 'y4'])
train_df = pd.concat([x_train_df, y_train_df], axis=1)

x_holdout_df = pd.DataFrame(x_test_scaled, columns=['x1', 'x2', 'x3', 'x4'])
y_holdout_df = pd.DataFrame(y_test_scaled, columns=['y1', 'y2', 'y3', 'y4'])
holdout_df = pd.concat([x_holdout_df, y_holdout_df], axis=1)

regressors = [
    LinearRegression,
    KNeighborsRegressor,
    DecisionTreeRegressor,
    MultiTaskLasso,
    MultiTaskElasticNet,
    Ridge,
    MLPRegressor,
]

regressor_params = [
    dict(),
    dict(),
    dict(),
    dict(alpha=0.01),
    dict(alpha=0.01),
    dict(alpha=0.05),
    dict(
        hidden_layer_sizes=(5,),
        activation='relu',
        solver='adam',
        learning_rate='adaptive',
        max_iter=1000,
        learning_rate_init=0.01,
        alpha=0.01,
    ),
]

#################### HyperparameterHunter ####################
env = Environment(
    train_dataset=train_df,
    holdout_dataset=holdout_df,
    root_results_path="multiple_regression_assets",
    metrics_map=["mean_squared_error"],
    target_column=['y1', 'y2', 'y3', 'y4'],
    cross_validation_type="KFold",
    cross_validation_params=dict(n_splits=10, shuffle=True, random_state=32),
)

for initializer, init_params in zip(regressors, regressor_params):
    exp = CrossValidationExperiment(
        model_initializer=initializer,
        model_init_params=init_params,
    )

@strelzoff-erdc
Copy link

strelzoff-erdc commented Nov 20, 2018

No Problem - it's actually useful for us to set up some toy problems to test ideas

  1. The prediction files for multiple regression problems should contain the predictions for ALL target columns, correct? In your first example, that would be [“y1”, “y2”, “y3”, “y4”]

Yes, we're trying to predict all target columns

  1. Regarding your scaling with StandardScaler, do you envision doing this to all of the data before providing it to HyperparameterHunter? Or should this be a preprocessing step performed by HyperparameterHunter, during cross-validation, for example?

For regression, StandardScale (mean set to zero all elements scaled to have std = 1.0) and or equivalent is always necessary. Think of it this way, for regression accuracy is always some kind of distance between all predicted points and their true locations (in target space). If our data and model was about astronomy then our MSE might be 1.5 light years. Is this a good model? How does it compare to some required level of accuracy. To get at that question we need need to scale.

Also, most fancier models (keras, tensorflow, ...) absolutely require scaling in order to work correctly. This is sometimes hidden or implicit in classification problems - for example MNIST images are turned to gray-scale where all pixels will then be between 0.0 and 1.0 and the mean will be around 0.5 - not exactly standard but close enough

I think scaling is part of data prep prior to HPH but it would be good if provided examples show scaling.

  1. Am I correct in saying that you need the following to be supported in HyperparameterHunter:

    1. Multiple target_columns for regression (this is already supported, but only tested for multi-classification)

    2. Prediction files that contain predictions for all of the target_columns (also only tested for multi-classification)

    3. Some way to work with sklearn.multioutput.MultiOutputRegressor

      • Is MultiOutputRegressor strictly being used because SVR and XGBRegressor don’t offer native support for multiple targets? In other words, is support for it necessary? If it were not supported, how much functionality would be lost?
    4. Metrics calculation based on multiple target columns

1,2 & 4 would be great.
For question 3, SVR and XGBRegressor and many SKLearn models do to not support multiple regression targets. MultiOutputRegressor is a handy way of packing a single problem set into one run. All of our work (and most real world predictive analytics) involves predicting multiple properties from a single set of gathered data (in our case 1000's of things)

If I am missing any requirements in my third question, please let me know, and if anything I’ve said seems even slightly questionable, I’d appreciate the correction.
Just want to make sure we’re on the same page! Thanks again for your fantastic example!

Unrelated: In your example's first train_test_split call, it should be using X and Y as input, rather than X_scaled and Y_scaled, correct?

oops, yes, the perils of Jupyter notebooks and not working strictly from top to bottom.

@HunterMcGushion
Copy link
Owner

HunterMcGushion commented Nov 20, 2018

I think scaling is part of data prep prior to HPH but it would be good if provided examples show scaling.

Understood. I've had an unfinished feature_engineering module on the back-burner for quite some time that I intend to implement so people can keep track of their feature engineering pipelines (including scaling). I'll start looking into getting that working.

1,2 & 4 would be great.
For question 3, SVR and XGBRegressor and many SKLearn models do to not support multiple regression targets. MultiOutputRegressor is a handy way of packing a single problem set into one run. All of our work (and most real world predictive analytics) involves predicting multiple properties from a single set of gathered data (in our case 1000's of things)

I see; thank you for explaining that. I was unfamiliar with SKLearn's multioutput module; however, after taking a look at its contents, I do have a question about how they are generally used.

I noticed that MultiOutputClassifier and MultiOutputRegressor require only an estimator argument (and optionally n_jobs). Adding support for these two seems more straight forward than ClassifierChain and RegressorChain, which add the kwargs order, cv, and random_state.
So my questions are, having never used these classes before:

  1. Are these kwargs necessary to achieve the full functionality of the classes?
  2. How do the latter two classes use the cv kwarg? Would it even be necessary since HyperparameterHunter is already fitting the model in a cross validation loop defined by the user? After my (admittedly) very limited exploration, it seems to me that the cv kwarg could cause problems/misunderstanding. However, I could be mistaken in assuming it uses cv to perform some standard cross-validation fitting loop, so I'd very much appreciate some clarification on it.
  3. In HyperparameterHunter, parameters are, for the most part, separated into one of two groups: 1) Experiment hyperparameters being tuned, and 2) Environment/cross-experiment parameters (like cross-validation parameters and random seeds) which determine when experiments can be fairly compared and learned from during optimization. Given that structure, how would you classify the extra kwargs in the multioutput classes (order, cv, random_state, n_jobs)? It seems to me that rather than being model initialization parameters, they would fit better with Environment parameters.

Thank you for your patience in explaining to me what you need and for helping make HyperparameterHunter even better! I appreciate you taking the time!

Edit: Another question, would it be an expectation that using ClassifierChain or RegressorChain would produce the predictions made by estimators earlier in the chain (in a separate output file, for example), or are these intermediate predictions disposable once final predictions have been made?

@strelzoff-erdc
Copy link

Hi,

I tried RegressorChain on the toy problem and it doesn't seem particularly useful. SKlearn doesn't have support for sequences (more or less deliberately). The chain estimators seem like a blip of effort toward sequences where it would be useful to "Roll-up" estimators from left to right.

Suppose you had two sequences and you wanted to predict the 2nd from the first. The more likely and general approach would be to run MultiOutputRegressor with each regressor targeting a y[i] and getting a staggered rolling window of x[i-10:i+10]s as input. This is the recommended approach for XGBoost to deal with sequence prediction problems.

I Tried the code your provided for MultiOutputRegressor which unfortunately fails during prediction return with error (in the cascade)

error line is from predictors.py
predictions = pd.DataFrame(data=predictions, index=index, columns=target_column, dtype=dtype)

with error ValueError: Shape of passed values is (100, 100), indices imply (4, 100)

presumably (4,100) is the expected fold of 4 y's

notebook with minimal changes to your code here - [HPH multi-regression code notebook](https://github.com/strelzoff-erdc/HPH-experiments/blob/master/HPH_multi-regression%20experiment%20test.ipynb

)

@HunterMcGushion
Copy link
Owner

HunterMcGushion commented Nov 21, 2018

Are you using the current master version of HyperparameterHunter? It looks like you might be using the latest PyPI release (2.0.0), in which predictions for those multi-regression algorithms had not yet been added.

I apologize; I should have clarified that I was running the examples with the unreleased master version. Can you try installing HyperparameterHunter from GitHub and running the example again?

pip install git+https://github.com/HunterMcGushion/hyperparameter_hunter.git

Edit: I also just realized that at the time I posted the example, I hadn't even pushed the changes that were making it work. So, once again, I apologize for causing this confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants