Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save the offset name in GLM and results wrapper #9100

Open
jmahlik opened this issue Dec 19, 2023 · 1 comment · May be fixed by #9130
Open

Save the offset name in GLM and results wrapper #9100

jmahlik opened this issue Dec 19, 2023 · 1 comment · May be fixed by #9130

Comments

@jmahlik
Copy link

jmahlik commented Dec 19, 2023

Is your feature request related to a problem? Please describe

Post model training, it is helpful to know which variable was used as the offset. This aids in post model analysis and deployment.

The offset array is saved and can be accessed after saving the model, but the name of the offset variable is lost when it is a pandas series. The series is converted to a np.array which removed the name. Current state, it is difficult to tell which variable may have been used as an offset without tracking it outside the model.

Example use case: Sharing a saved model with a peer. They inspect it to determine what variable was used as the offset in training.

The same may apply to the var_weights and freq_weights for GLM.

Describe the solution you'd like

The model has access on __init__ to the name of the offset if it is a pandas series. A way to save the offset array's name if it is a series would be wonderful.

Similar to how the endog and exog names can be used in the model summary.

Here's a few ideas I had for how to implement this. Happy to hear if there's a better option.

  1. Add an offset_name property for GLM
  2. Add it to the model.data so it's handled by PandasData
    • The name could be added back to the offset when making the results wrapper (at least I think that's how it works)
    • I could use some guidance on how to implement this if it is the preferred approach
    • I think it has something to do with the data attrs but it's a bit hard to track down
  3. Do not convert to a numpy array if it is a series
    • One could use model.offset.name to get at the variable name
    • Doesn't line up with how the rest of the code works, it expects numpy arrays
    • Likely not a good option
  4. User adds offset_name attribute to the model class before saving it.
    • Seems like a bad idea, would like support in statsmodels

Describe alternatives you have considered

Current workaround is saving the offset name in a separate file, which is not ideal.

Additional context

Happy to work on a PR for this.

@josef-pkt
Copy link
Member

I think currently 1. is the only option. 2. would be good but currently the extra arrays are not going through the endog/exog model.data handling (at least not in most cases.

We could add a helper function that can be added to the __init__ as replacement for np.asarray which does asarray plus return additionally the name of the variable if it is available.
This could also be applied to other extra data like exposure and the various weights.

Current extra data like offset, exposure, weights are 1dim.
For flexibility the helper function could check for and distinguish 1dim and 2dim. In the later case, return individual column names instead of the Series name.

The same as in GLM also applies to discrete models and likely to some other models.

jmahlik added a commit to StateFarmIns/statsmodels that referenced this issue Jan 18, 2024
Offset, exposure, freq_weights and var_weights have the name of the
series saved on the model object. They can be accessed via the class
properties.

Closes statsmodels#9100
jmahlik added a commit to StateFarmIns/statsmodels that referenced this issue Jan 18, 2024
Offset, exposure, freq_weights and var_weights have the name of the
series saved on the model object. They can be accessed via the class
properties.

Closes statsmodels#9100
jmahlik added a commit to StateFarmIns/statsmodels that referenced this issue Jan 18, 2024
Offset, exposure, freq_weights and var_weights have the name of the
series saved on the model object. They can be accessed via the class
properties.

Closes statsmodels#9100
@jmahlik jmahlik linked a pull request Jan 24, 2024 that will close this issue
4 tasks
@josef-pkt josef-pkt added this to the 0.15 milestone Apr 12, 2024
jmahlik added a commit to StateFarmIns/statsmodels that referenced this issue Apr 15, 2024
Offset, exposure, freq_weights and var_weights have the name of the
series saved on the model object. They can be accessed via the class
properties.

Closes statsmodels#9100
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants