Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proportion outcome between [0,1] #75

Open
mikejacktzen opened this issue Sep 26, 2022 · 6 comments
Open

proportion outcome between [0,1] #75

mikejacktzen opened this issue Sep 26, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@mikejacktzen
Copy link

mikejacktzen commented Sep 26, 2022

Hi, @fabsig thank you for your work, this sounds like an exciting method.

IIRC, you currently only support 0/1 binary outcomes with a logistic link, ctrl+F searching for 'logit'

https://github.com/fabsig/GPBoost/blob/18f32760ac8617e3db65e1b0993fc4f00c1017be/include/GPBoost/likelihoods.h

Is there a small modification you can make where it enables gpboost to run on proportion outcomes between [0,1] ?

# in R

X = matrix(rnorm(2*100), ncol=2)
b = c(2, -2)
eta = X%*%b - 1
p = plogis(eta)
n = rep(10, length(p))
# Simulate y wins in n games
y = rbinom(100, n, p)

outcome_xgb = y/n
group_data = sample(letters[1:4], length(p),replace=TRUE)

gp_model <- fitGPModel(group_data=group_data, likelihood="bernoulli_logit", y=outcome_xgb, X=X)

I get the obvious error

Error in gpb.call("GPB_OptimLinRegrCoefCovPar_R", ret = NULL, private$handle, :
[GPBoost] [Fatal] Response variable (label) data needs to be 0 or 1 for likelihood of type 'bernoulli_logit'.

I was hoping it would work similary to xgboost when using the 'reg:logistic' objective
https://datascience.stackexchange.com/questions/10595/difference-between-logistic-regression-and-binary-logistic-regression


library(xgboost)
# ?xgboost
# https://datascience.stackexchange.com/questions/10595/difference-between-logistic-regression-and-binary-logistic-regression

# objective = "reg:logistic"
param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2, objective = "reg:logistic")

# convert pair (y,n) into scalar proportion (y/n)
outcome_xgb = y/n
dtrain <- xgb.DMatrix(X, label = outcome_xgb)
dtest <- xgb.DMatrix(X, label = outcome_xgb)
watchlist <- list(train = dtrain, eval = dtest)

bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
pred <- predict(bst,dtrain)
head(pred)

This feature would solve a special use case and I think would be helpful in other scenarios as well.

I see this post, and hope this is not a big ask, but understand if there is hidden complexity that I can not for see

#7

@fabsig
Copy link
Owner

fabsig commented Sep 27, 2022

@mikejacktzen Thank you for your suggestion.

It seems that this is a useful add-on that can be implemented with a manageable effort. May I ask, what is the approach of 'reg:logistic' in XGBoost? Is it the quasi-likelihood approach of Papker and Wooldridge (1996) (Equation (5))?

@fabsig fabsig added the enhancement New feature or request label Sep 27, 2022
@mikejacktzen
Copy link
Author

mikejacktzen commented Sep 30, 2022

That sounds right. Is something similar to the quasi likelihood already what lightgbm uses? If so, that's even better, time / coding wise.

I'm hoping this request does not have to define a whole new objective, and instead can modify the existing objective.

Perhaps remove some of the safety checks that currently enforce 0/1 and allow for [0,1] to be used for the logistic (quasi) likelihood (if it's that easy). Using variants of the logistic loss, would be perfect.

I did some more digging around about XGboost, sounds like it does not make a substantive distinction between the two 'binary:logistic' vs 'reg:logistic' when it comes to the loss objective during training.

https://datascience.stackexchange.com/questions/9802/what-is-the-difference-in-xgboost-binarylogistic-and-reglogistic

For another example, the glmnet package in R allows for both the 0/1 vs [0,1] functionality using the same binomial(N=1,p) logistic likelihood internally.

@mikejacktzen
Copy link
Author

mikejacktzen commented Oct 2, 2022

I think i should have started my hunt using lightgbm examples!

Looks like lightgbm supports cross-entropy allowing for [0,1] labels with a loss similar to binomial(N=1,p) ~= quasi likelihood

https://lightgbm.readthedocs.io/en/latest/Parameters.html#core-parameters

Maybe enabling cross-entropy in gpboost would be the easiest route?

# in R

library(lightgbm)


X = matrix(rnorm(2*100), ncol=2)
b = c(2, -2)
eta = X%*%b - 1
p = plogis(eta)
n = rep(10, length(p))
# Simulate y wins in n games
y = rbinom(100, n, p)

# https://github.com/Microsoft/LightGBM/issues/1348


train_matrix = lgb.Dataset(data = X, label = y)
val_matrix = lgb.Dataset(data = X, label = y)
valid = list(test = val_matrix)

# 0/1 logistic logloss

params = list(max_bin = 5,
              learning_rate = 0.001,
              objective = "binary",
              metric = 'binary_logloss')
bst = lightgbm(params = params, train_matrix, valid, nrounds = 10)

# cross-entropy


outcome_prop = y/n
train_matrix = lgb.Dataset(data = X, label = outcome_prop)
val_matrix = lgb.Dataset(data = X, label = outcome_prop)
valid = list(test = val_matrix)

params = list(max_bin = 5,
              learning_rate = 0.001,
              objective = "cross_entropy")
bst = lightgbm(params = params, train_matrix, nrounds = 10)


@fabsig
Copy link
Owner

fabsig commented Oct 3, 2022

Thanks for the update. GPBoost cannot just use LightGBM losses as computations are very different, but it should still not be too much effort. I plan to implement this once I have time.

@fabsig
Copy link
Owner

fabsig commented Oct 25, 2022

@mikejacktzen: Sorry for the long silence. I finally started to have a look at this. It turns out that things are slightly more complicated than I first assumed. While simply implementing a Bernoulli quasi log-likelihood function, such as what LightGBM calls cross_entropy 🤨, is relatively simple from a software engineering point of view, it is currently unclear to me how to implement this in a sound way for models with random effects without doing something stupid from a statistical point of view. For instance, robust standard errors via a sandwich estimator need to be used when using a Bernoulli quasi log-likelihood function. See, e.g., this discussion and this one for more details in the GLMM case (yes, the GPBoost algorithm does not come with standard errors for the fixed effects, but the GPBoost library also allows for GLMMs, and it calculates approximate standard errors of (co-)variance parameters). It is currently not clear to me whether, e.g., lme4 even allows for a Bernoulli quasi log-likelihood (not a binomial one) for fractional response data in a GLMM context. Besides, a Bernoulli quasi log-likelihood function is only one of several ways for modeling fractional response data (others include distributions with support in [0,1] and the binomial distribution if the "number of trials" per observation are known).

Having said this, I have spent relatively little time on the topic. To implement something about which I am confident that it makes sense, I would need to spend more time on this. If, e.g., a collaboration for an article is an option for you, I would be willing to invest time into this. Let me know. Otherwise, this is currently on hold for me. It goes without saying that contributions are very welcome, but they should also come with a sound justification of what (and why) is being done.

@mikejacktzen
Copy link
Author

Thanks for taking a look Fabio. I've come to realize that the things that seem easy are usually the hardest to achieve. I'll follow up in an email off line that ties together some of the feature request (including this one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants