Sample Weight Support #12

kmedved · 2021-02-17T23:06:11Z

Hello - thanks for the wonderful package. From the writeup and the description, it seems very promising.

I wanted to check if GPBoost supports or will support sample weights? I have tried in both the native API and scikit-learn API, and gotten the following error message:

GPBoostError: Weighted data is currently not supported for the GPBoost algorithm. If this is desired, contact the developer or open a GitHub issue.

It's a bit confusing since the API has support for sample weights seemingly, but it looks like they may just not be implemented yet? If so, are there any plans to implement them? This is a key functionality for some domains, where observations many have radically different weights, and fitting an unweighted set will tend to give misleading results.

Thanks!

The text was updated successfully, but these errors were encountered:

fabsig · 2021-02-19T16:39:52Z

Thank you for your feedback and suggestion.

I have to admit that it is currently unclear to me how one can use sample weights for Gaussian process and random effects models for Gaussian and also non-Gaussian data. But I have not done a thorough analysis on this. With dependent data (i.e. Gaussian process and random effects models), some things are slightly more complicated compared to independent data assumed e.g. in standard boosting algorithms, where you simply weight the loss / likelihood contribution of every sample. If there exists a sound approach for incorporating sample weights into Gaussian process regression, we can try to add it here.

kmedved · 2021-02-19T17:26:40Z

Thanks @fabsig - I apologize for my lack of familiar with the underlying math behind Gaussian process modeling, but traditionally I see sample weights implemented by multiplying the loss (or squared loss) per row by the sample weight for that row, and then minimizing the resulting vector.

Is that possible with Gaussian process modeling?

fabsig · 2021-02-19T18:44:06Z

Yes, you are right. But this is not possible for Gaussian processes / random effects, as there is not one loss per sample but only one "global" loss for all samples together. I am not saying that there is no way for doing it, I just did not have time to think about it and research it in detail (e.g. for Gaussian data, one might weight the error variances accordingly, but this only works for Gaussian data...).

kmedved · 2021-02-19T20:56:19Z

Understood. I took a look at how the MERF package handles this, but the solution does not seem transferable unfortunately.

Thanks for all your work on this package - eager to see a solution if one exists.

fabsig · 2021-02-20T20:24:15Z

Thank you for the hint. No, this is not a meaningful option. In my opinion, the option proposed in the issue you mention (allowing the user to provide weights to the random forest function) also makes no sense for the MERF algorithm. With this option, one only considers the weights in the fixed effects estimation step but ignores them in the other steps of the MERF algorithm (estimation of variance parameters, estimation of random effects). And since all these steps are connected to each other, it is unclear what is being done...

Technically speaking, we could also implement something similar, but it does not make much sense. This is not a software engineering problem but a statistical problem.

kmedved · 2021-02-20T21:16:07Z

Thanks - that's helpful context.

FWIW, I would be interested in a similar solution, even if it is not statistically sound. It may still generate useful results in some contexts, even subject to those limitations. The upside is that I don't think it should be too much to work to implement (since all the scikit-learn base estimators already accept sample_weights), but I understand if you don't want to add misleading or half-baked functionality to this package.

Either way, thank you for all your efforts on this.

jwdink · 2021-07-22T20:57:13Z

But this is not possible for Gaussian processes / random effects, as there is not one loss per sample but only one "global" loss for all samples together.

As I understand it, for a random-effects model where the response is assumed to have gaussian error, the response vector has a multivariate gaussian likelihood with covariance:

$$Z S Z' + s^2 * I_n$$

Could sample-weights be implemented by replacing I_n with the inverse of the weights vector?

I'm not sure how this can be extended to non-gaussian responses (I don't quite know how GPBoost implements these), but wanted to check if this might be helpful for the gaussian case at least.

fabsig · 2021-07-23T11:10:30Z

Yes, this seems like a reasonable approach for Gaussian data. That's the same approach I also mentioned in this comment:

for Gaussian data, one might weight the error variances accordingly, but this only works for Gaussian data

I will keep this issue open with the "enhancement" label. But it is not at the top of my to-do list. Contributions are welcome.

kmedved · 2021-07-23T12:47:40Z

This would be very helpful if possible to add. It's unfortunately over my head to work on, both in a math and coding sense.

mikejacktzen · 2022-09-27T19:48:30Z

I have some familiarity with this topic, it's very tricky.

First, I'm assuming this discussion is all about 'probability' == 'sampling' == 'scale up' == 'representation' weights?
If unit i has weight w_i = 3, then that means unit i represents 3 units in your target population.

Not to be confused with all the other meanings of 'weight'
https://notstatschat.rbind.io/2020/08/04/weights-in-statistics/

If so, this is a hard methodological and computational problem.

As alluded to in the proposed idea here
#12 (comment)

the problem is, As @tslumley puts it "where do you stick the probability representation weights?"

https://notstatschat.rbind.io/2018/03/13/why-pairwise-likelihood/
https://notstatschat.rbind.io/2018/04/01/svylme/
https://notstatschat.rbind.io/2018/10/19/progress-on-svy2lme/

fabsig · 2022-09-28T06:54:30Z

Yes, using this terminology it's about 'probability' == 'sampling' == 'scale up' == 'representation' weights. Afaik, this is the predominant way how weights are used in machine learning. You want to give some observations a higher "weight" (for whatever reasons, whether it's really scaling up sampling probabilities to population probabilities or simply based on heuristic arguments...).

I think the approach mentioned by @jwdink makes sense for data with a Gaussian likelihood. For independent data with a Gaussian likelihood (OLS regression, tree-boosting / random forest / neural networks for regression, etc.), dividing variances by the weights is equivalent to multiplying every log-likelihood / loss contribution by the corresponding weight. In analogy to this, you can divide the error variance / nugget effect variance by the weights in a mixed effects / GP model. This seems like a reasonable solution for "where to stick the weights". For non-Gaussian data, it is currently unclear to me how to handle weights.

@mikejacktzen: the blog article you mention is about the use of pairwise composite likelihoods, which is an arguably related but also slightly different issue.

As said, I will keep this issue open with the "enhancement" label. But it is not at the top of my to-do list. Contributions are welcome.

fabsig added the enhancement New feature or request label Feb 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample Weight Support #12

Sample Weight Support #12

kmedved commented Feb 17, 2021

fabsig commented Feb 19, 2021

kmedved commented Feb 19, 2021

fabsig commented Feb 19, 2021

kmedved commented Feb 19, 2021

fabsig commented Feb 20, 2021 •

edited

kmedved commented Feb 20, 2021

jwdink commented Jul 22, 2021 •

edited

fabsig commented Jul 23, 2021 •

edited

kmedved commented Jul 23, 2021

mikejacktzen commented Sep 27, 2022

fabsig commented Sep 28, 2022

Sample Weight Support #12

Sample Weight Support #12

Comments

kmedved commented Feb 17, 2021

fabsig commented Feb 19, 2021

kmedved commented Feb 19, 2021

fabsig commented Feb 19, 2021

kmedved commented Feb 19, 2021

fabsig commented Feb 20, 2021 • edited

kmedved commented Feb 20, 2021

jwdink commented Jul 22, 2021 • edited

fabsig commented Jul 23, 2021 • edited

kmedved commented Jul 23, 2021

mikejacktzen commented Sep 27, 2022

fabsig commented Sep 28, 2022

fabsig commented Feb 20, 2021 •

edited

jwdink commented Jul 22, 2021 •

edited

fabsig commented Jul 23, 2021 •

edited