Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample Weight Support #12

Open
kmedved opened this issue Feb 17, 2021 · 11 comments
Open

Sample Weight Support #12

kmedved opened this issue Feb 17, 2021 · 11 comments
Labels
enhancement New feature or request

Comments

@kmedved
Copy link

kmedved commented Feb 17, 2021

Hello - thanks for the wonderful package. From the writeup and the description, it seems very promising.

I wanted to check if GPBoost supports or will support sample weights? I have tried in both the native API and scikit-learn API, and gotten the following error message:

GPBoostError: Weighted data is currently not supported for the GPBoost algorithm. If this is desired, contact the developer or open a GitHub issue.

It's a bit confusing since the API has support for sample weights seemingly, but it looks like they may just not be implemented yet? If so, are there any plans to implement them? This is a key functionality for some domains, where observations many have radically different weights, and fitting an unweighted set will tend to give misleading results.

Thanks!

@fabsig
Copy link
Owner

fabsig commented Feb 19, 2021

Thank you for your feedback and suggestion.

I have to admit that it is currently unclear to me how one can use sample weights for Gaussian process and random effects models for Gaussian and also non-Gaussian data. But I have not done a thorough analysis on this. With dependent data (i.e. Gaussian process and random effects models), some things are slightly more complicated compared to independent data assumed e.g. in standard boosting algorithms, where you simply weight the loss / likelihood contribution of every sample. If there exists a sound approach for incorporating sample weights into Gaussian process regression, we can try to add it here.

@fabsig fabsig added the enhancement New feature or request label Feb 19, 2021
@kmedved
Copy link
Author

kmedved commented Feb 19, 2021

Thanks @fabsig - I apologize for my lack of familiar with the underlying math behind Gaussian process modeling, but traditionally I see sample weights implemented by multiplying the loss (or squared loss) per row by the sample weight for that row, and then minimizing the resulting vector.

Is that possible with Gaussian process modeling?

@fabsig
Copy link
Owner

fabsig commented Feb 19, 2021

Yes, you are right. But this is not possible for Gaussian processes / random effects, as there is not one loss per sample but only one "global" loss for all samples together. I am not saying that there is no way for doing it, I just did not have time to think about it and research it in detail (e.g. for Gaussian data, one might weight the error variances accordingly, but this only works for Gaussian data...).

@kmedved
Copy link
Author

kmedved commented Feb 19, 2021

Understood. I took a look at how the MERF package handles this, but the solution does not seem transferable unfortunately.

Thanks for all your work on this package - eager to see a solution if one exists.

@fabsig
Copy link
Owner

fabsig commented Feb 20, 2021

Thank you for the hint. No, this is not a meaningful option. In my opinion, the option proposed in the issue you mention (allowing the user to provide weights to the random forest function) also makes no sense for the MERF algorithm. With this option, one only considers the weights in the fixed effects estimation step but ignores them in the other steps of the MERF algorithm (estimation of variance parameters, estimation of random effects). And since all these steps are connected to each other, it is unclear what is being done...

Technically speaking, we could also implement something similar, but it does not make much sense. This is not a software engineering problem but a statistical problem.

@kmedved
Copy link
Author

kmedved commented Feb 20, 2021

Thanks - that's helpful context.

FWIW, I would be interested in a similar solution, even if it is not statistically sound. It may still generate useful results in some contexts, even subject to those limitations. The upside is that I don't think it should be too much to work to implement (since all the scikit-learn base estimators already accept sample_weights), but I understand if you don't want to add misleading or half-baked functionality to this package.

Either way, thank you for all your efforts on this.

@jwdink
Copy link

jwdink commented Jul 22, 2021

But this is not possible for Gaussian processes / random effects, as there is not one loss per sample but only one "global" loss for all samples together.

As I understand it, for a random-effects model where the response is assumed to have gaussian error, the response vector has a multivariate gaussian likelihood with covariance:

$$Z S Z' + s^2 * I_n$$

Could sample-weights be implemented by replacing I_n with the inverse of the weights vector?

I'm not sure how this can be extended to non-gaussian responses (I don't quite know how GPBoost implements these), but wanted to check if this might be helpful for the gaussian case at least.

@fabsig
Copy link
Owner

fabsig commented Jul 23, 2021

Yes, this seems like a reasonable approach for Gaussian data. That's the same approach I also mentioned in this comment:

for Gaussian data, one might weight the error variances accordingly, but this only works for Gaussian data

I will keep this issue open with the "enhancement" label. But it is not at the top of my to-do list. Contributions are welcome.

@kmedved
Copy link
Author

kmedved commented Jul 23, 2021

This would be very helpful if possible to add. It's unfortunately over my head to work on, both in a math and coding sense.

@mikejacktzen
Copy link

I have some familiarity with this topic, it's very tricky.

First, I'm assuming this discussion is all about 'probability' == 'sampling' == 'scale up' == 'representation' weights?
If unit i has weight w_i = 3, then that means unit i represents 3 units in your target population.

Not to be confused with all the other meanings of 'weight'
https://notstatschat.rbind.io/2020/08/04/weights-in-statistics/

If so, this is a hard methodological and computational problem.

As alluded to in the proposed idea here
#12 (comment)

the problem is, As @tslumley puts it "where do you stick the probability representation weights?"

https://notstatschat.rbind.io/2018/03/13/why-pairwise-likelihood/
https://notstatschat.rbind.io/2018/04/01/svylme/
https://notstatschat.rbind.io/2018/10/19/progress-on-svy2lme/

@fabsig
Copy link
Owner

fabsig commented Sep 28, 2022

Yes, using this terminology it's about 'probability' == 'sampling' == 'scale up' == 'representation' weights. Afaik, this is the predominant way how weights are used in machine learning. You want to give some observations a higher "weight" (for whatever reasons, whether it's really scaling up sampling probabilities to population probabilities or simply based on heuristic arguments...).

I think the approach mentioned by @jwdink makes sense for data with a Gaussian likelihood. For independent data with a Gaussian likelihood (OLS regression, tree-boosting / random forest / neural networks for regression, etc.), dividing variances by the weights is equivalent to multiplying every log-likelihood / loss contribution by the corresponding weight. In analogy to this, you can divide the error variance / nugget effect variance by the weights in a mixed effects / GP model. This seems like a reasonable solution for "where to stick the weights". For non-Gaussian data, it is currently unclear to me how to handle weights.

@mikejacktzen: the blog article you mention is about the use of pairwise composite likelihoods, which is an arguably related but also slightly different issue.

As said, I will keep this issue open with the "enhancement" label. But it is not at the top of my to-do list. Contributions are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants