Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let Multivariable Regression only fit 1 variable of a "family" #48

Open
JrtPec opened this issue May 29, 2018 · 1 comment
Open

Let Multivariable Regression only fit 1 variable of a "family" #48

JrtPec opened this issue May 29, 2018 · 1 comment

Comments

@JrtPec
Copy link
Member

JrtPec commented May 29, 2018

I tried something extreme, and the results were too: I generated weather data with solar orientations, tilts, wind directions, ... in total about 1600 variables which resulted in this formula:

Value ~ HDD_13 + GlobalIrradianceO270T90 + HDD_3 + windComponentSquared180 + GlobalIrradianceO265T80 + precipIntensity + windComponent95 + GlobalIrradianceO265T75 + CDD_22 + GlobalIrradianceO275T20 + GlobalIrradianceO260T50 + GlobalIrradianceO40T60 + windComponentCubed145 + GlobalIrradianceO0T0 + GlobalIrradianceO35T90 + GlobalIrradianceO100T55 + GlobalIrradianceO0T85

And got a miraculous RSquared of 1!

I could obviously fix it by reducing the number of variables. But what might also work is this: define certain "families" of variables (for instance, the heating degree days), and make sure the Analysis only uses 1 of them to make its model.
Could just be a list of lists, like

var_structure = [
    [HDD_10, HDD_11, ..., HDD_24],
    [CDD_10, ...],
    [GlobalIrradianceO0T0, GlobalIrradianceO10T10, ...],
    ...
]

@saroele thoughs?

@saroele
Copy link
Member

saroele commented May 29, 2018

Looks like you had a fun day :-)

This is exactly what @kdebrab mentioned yesterday: with lots of potential dependent variables, you will get a perfect model (R²=1).

Can you post the fit.summary() of the result? I want to have a look at model statistics.

The list-of-list approach to create groups of dependent variables should work, but could again lead to an overfitted model. So preferentially, I'd like to find a way to avoid overfitting in general, without imposing any limits on the combination of variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants