Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: allow user to pass in a vector of values for the residual error in the gaussian outcomes case #76

Open
mikejacktzen opened this issue Sep 27, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@mikejacktzen
Copy link

hi @fabsig

From the readme,

https://github.com/fabsig/GPBoost#gpboost-and-lagaboost-algorithms

Focusing on the specific case of gaussian outcomes, Using the convention

y = F(X) + Zb + xi

where xi is an independent error term and X are predictor variables (aka covariates or features).

Is it possible to enable a feature where the user could supply in the values of xi. Therefore xi is not estimated, the user passes in a vector of values 'xi = c(1, ...., 10)' where the length of xi is equal to the length of the outcome

as,

# in R

gp_model <- fitGPModel(group_data=group_data, 
               xi=c(1,3,4, ... , 10) ,
               likelihood="gaussian",
               y=outcome_xgb, X=X)

If you think this is do-able, I could elaborate further on the motivation

Thank you.

@fabsig
Copy link
Owner

fabsig commented Sep 27, 2022

@mikejacktzen: Thank you for your suggestion.

It seems to me that the easiest way to do this would be to

  1. subtract \xi from y and pass y' = y - \xi as "new y" to GPBoost
  2. and setting the error variance \sigma^2 to a very low value and not estimating it.

Unfortunately, the second point (telling GPBoost to hold the error variance fixed and not estimate it) is currently not possible. This would require some software engineering. I might implement it in the future, but it is not at the top of my priority list. Contributions are welcome.

As an alternative you might try my above-suggested approach (1.) without (2.) and check how large the estimated error variance is. Note that the \xi values themselves are anyway not estimated.

Note on the above-suggested approach: even if you assume that xi is known exactly and that the only source of randomness are the random effects Zb, you usually need to add a small number to the diagonal of the random effects covariance matrix in order to ensure that the covariance matrix is "numerically" strictly positive definite and can be inverted with the Cholesky decomposition without numerical problems. And this approach is equivalent to setting the error variance to a very small known value.

@fabsig fabsig added the enhancement New feature or request label Sep 27, 2022
@mikejacktzen
Copy link
Author

mikejacktzen commented Sep 27, 2022

Thanks for the thoughtful reply. I understand to your agreeable points. Those numerical stability checks are always a fun surprise.

I think I rushed my original post (excited about the possibility), instead of editing it (so it doesn't confuse other readers when they get to your reply), i'll continue here. My rushed OP confused xi (the residual terms) with the covariance term of its distribution.

What I really mean is, for the normal outcomes case only, let the user provide the covariance matrix D of the residual errors xi

y = F(X) + Zb + xi
xi ~ Normal(0,D)

  • I would only use this for the special case when D is the diagonal matrix of a user supplied vector s_xi,
    D = diagonal(s_xi) = s_xi .* I

  • the most general case is if you let the user provide arbitrary D, this puts the burden on the user to create D and do their Semi Positive Definite checks ( i think a good thing ). so you don't get flooded with requests for any new variant structure of D (like i'm doing here).

Maybe this is the limitation your make in point (2) in your reply above, which sounds like what I need.

gp_model <- fitGPModel(group_data=group_data, 
               s_xi=c(1,3,4, ... , 10) ,  # if you restrict to the special case in the underlying C code
               likelihood="gaussian",
               y=outcome_xgb, X=X)

gp_model <- fitGPModel(group_data=group_data, 
               D=diag(s_xi),  # if you let the user pass in any D
               likelihood="gaussian",
               y=outcome_xgb, X=X)

# Should only work for the gaussian case
# if ! is.null(D) & likelihood != 'gaussian' error

With this new corrected detail, do your points still apply?

When xi ~ Normal(0,D) ,

Are the gpboost internal algorithms crucially reliant on working with D = scalar .* Identity, (how I understand the current set up)

If not, I hope a easy next feature upgrade would allow for D = vector .* Identity. where the vector only has positive entries

If so, I can continue this via email, with some very important use case contexts that I think is ripe for co authoring papers.

@fabsig
Copy link
Owner

fabsig commented Sep 28, 2022

Thanks for the clarification. Providing the covariance matrix of the error term instead of the error term itself makes much more sense. This could be implemented but requires some software engineering. I would be willing to work on this if the goal is a joint publication in a collaboration. Just write me an email. You can find my email address on my public university homepage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants