Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

confint with sparse factor levels breaks #49

Open
hofnerb opened this issue Aug 25, 2016 · 4 comments
Open

confint with sparse factor levels breaks #49

hofnerb opened this issue Aug 25, 2016 · 4 comments
Assignees
Labels

Comments

@hofnerb
Copy link
Member

hofnerb commented Aug 25, 2016

see #47

@hofnerb hofnerb added the bug label Aug 25, 2016
@hofnerb hofnerb self-assigned this Aug 25, 2016
@davidruegamer
Copy link
Member

davidruegamer commented Aug 25, 2016

Regarding your comment in #47 : My use case is only partly related to mboost as we calculated bootstrap intervals for functional effects fitted via FDboost and using confint would actually be too complicated.

If you construct CIs for factor variables droplevels might be a problem, yet, not using droplevels is a problem as well. The question is: What does it actually mean if a level was dropped? Is it equal to the level beeing estimated as zero? As I use predictions this should be somehow managable. What were your considerations [...]?

In my case, resampling was done on the level of correlated observations, i.e. on subject level, with each subject having gone through every other possible study setting. So I actually did not had to deal with sparse factor levels (and dropping levels should be fine for random effects?).
But in general, if there happen to be unfilled categories in far more than one sample I would throw an error. In this case, I would say, the problem falls back to insufficient informative data for mimicing the true distribution $F_{Y,X}$ and therefore is not a problem of mboost. I'm not quite sure what exactly happens in confint at the moment when there are unfilled levels, but if there are just a hand full of samples with unfilled categories, I still would not set the estimate to zero. I would rather change the behaviour of .ci_mboost to calculate the intervals for this specific factor (level) only on the basis of those samples, in which all levels of the factor variable are present.

@hofnerb
Copy link
Member Author

hofnerb commented Aug 25, 2016

Just to understand you correctly: You were computing CIs for fixed effects and were not interestesd in the random effects? In that case I would agree that dropping unused levels should not pose any problem.

Regarding the second part of your answer I have to rethink this. In a parametric setting, the CI would get rather big in that case as the standard error gets large.

With setting the estimate to zero I meant only the estimate on the current fold which then becomes the basis for the CI. However, that isn't correct either as you are right. Currently the code just breaks. Perhaps we keep this behavior and simply throw a more informative error to let the user know that sparse categories hamper the computation of bootstrap CIs. Well; I have to check this in a small simulation....

@davidruegamer
Copy link
Member

davidruegamer commented Aug 25, 2016

Just to understand you correctly: You were computing CIs for fixed effects and were not interestesd in the random effects? In that case I would agree that dropping unused levels should not pose any problem.

Yes, exactly. Thanks for the response!

With setting the estimate to zero I meant only the estimate on the current fold which then becomes the basis for the CI.

So did I. But I think precisely this proceeding is problematic. For example, think about a model for the probability of suffering a stroke (Yes / No). If there is a factor variable "suffered_stroke_before", which is zero / FALSE for most observations but highly predictive for Yes if one / TRUE, you certainly do not want to set the effect to zero for a large number of folds (though the corresponding confidence interval would probably just touch and not cross the value zero).

Perhaps we keep this behavior and simply throw a more informative error to let the user know that sparse categories hamper the computation of bootstrap CIs.

It's probably for the best. I would even go so far as to say, that CIs on the basis of bootstrapped (shrinked) boosting coefficients is a feature for advanced user (which are aware of the origin of those intervals) and throwing an error is in line with the actual purpose of the function (rather a "I'm aware of intervals do not necessarily comply with the nominal level and are biased due to the shrinkage"-function, than a a black box interval function, which always returns something).

@hofnerb
Copy link
Member Author

hofnerb commented Nov 22, 2016

Start for test:

### check confidents intervals for factors with very small level frequencies
z <- factor(c(sample(1:5, 100, replace = TRUE), 6), levels = 1:6)
y <- rnorm(101)
mod <- mboost(y ~ bols(z))
confint(mod)

(to be added to tests/regtest-inference.R)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants