Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doing some treatment for missing values #18

Open
p9anand opened this issue May 31, 2019 · 13 comments
Open

Doing some treatment for missing values #18

p9anand opened this issue May 31, 2019 · 13 comments
Labels
enhancement New feature or request

Comments

@p9anand
Copy link

p9anand commented May 31, 2019

Can we add some random values/treat them as ""null"" for missing data. So that we can avoid error in during data exploration using :
hist = ClassHistogram().explain_data(X_train, y_train, name = 'TrainData').

@interpret-ml
Copy link
Collaborator

interpret-ml commented Jun 3, 2019

Hi @p9anand - thanks for bringing this up!

We're working on missing values at the moment.
The approach we're following is to treat missing as its own special value, and visualize appropriately. We'd seen some strange problems in the past with imputation, and we're hoping to avoid it this way.

@interpret-ml interpret-ml added the enhancement New feature or request label Jun 8, 2019
@timvink
Copy link

timvink commented Feb 28, 2020

Exciting implementation, and promising algo when intelligibility is as important as accuracy!

Can you share how you are planning to implement missing value support?

The paper does not mention missing values. As I understand it, GA2M is not fitted with splines but rather regression trees.

Then I could see two approaches:

  1. In XGBoost missing values are supported by learning default branch direction during training (unless using gblinear booster). The same could be done for the univariate and bivariate trees.

  2. One-hot encode the missing value (f.e. with sklearn.impute.MissingIndicator) and learn a B co-efficient

Also interested in ideas how to put missing value info back into the visualizations.

@antonkulaga
Copy link

antonkulaga commented Aug 4, 2020

I also suffer from lack of support of missing values by ExplainableBoostingRegressor. I deal with cross-species gene expression datasets with many NA-s :_(

@stefanhgm
Copy link

stefanhgm commented Nov 5, 2020

It is pretty straightforward to integrate missing values for categorical variables by simply adding another category. For continuous variables I tried a very simple workaround (slight modification of the EBM code) by encoding unknown with "minimum value - 1", so that its smaller than all regular values and treated separately during training. However, it seems that the EBM algorithm often ignores these outlier values even if they occur very frequently and missing ends up getting the same risk as the minimum value. Hence, it is not treated as the special value it is. Any ideas how to make this simple modification work for continuous values? Or is there any progress on a native implementation of missing values for the EBM algorithm?

Thanks!

@interpret-ml
Copy link
Collaborator

interpret-ml commented Feb 4, 2021

Hi @p9anand, @timvink, @antonkulaga, @stefanhgm -

Our latest release includes some work to handle missing values better. To enable this though, you'd need to modify the code in ebm.py by changing all the places it has "missing_data_allowed=False" into "missing_data_allowed=True" and then it should work. We didn't include this in the release since our graphing code still needs to be updated to handle missing values, but the underlying core framework should handle them now. The graphing code should still function, but it won't show the missing value bin that gets created. If you want to see the missing value score today, you would need to check the additive_terms_ field, which has the missing value in index 0.

The change needed to re-enable missing value support would be to do the opposite of:
f298c73

The current implementation puts all the missing values into it's own bin on the left side. We do plan to improve this and implement the XGBoost method where the missing values are merged into the side that improves gain the most on each boosting step.

-InterpretML team

@stefanhgm
Copy link

Hi.

Thanks for coming back on this issue and for your detailed explanation! Works as described for me and, unsurprisingly, produces very similar results as the workaround I mentioned earlier. Just as a clarification: as far as I see it you create an extra bin at position [0] (1D) or [0,...]/[...,0] (2D) that possibly receives a non-zero weight during training even if there are no missing values? And is there any indicator in EBM that tells me if a variable has missing values? Otherwise, I could also infer that from the data myself.

Implementation of the XGBoost method would be awesome. Is that something that will be accomplished in the next weeks or will it take more time? Just wondering for a current project. For visualization, we used the aforementioned workaround for missing values and a custom plotting implementation in a recent project where we reserved 10% of the axes for a visually separated unknown bin (see Appendix in https://bit.ly/39RcYex).

@interpret-ml
Copy link
Collaborator

interpret-ml commented Feb 7, 2021

Hi @stefanhgm --

Thanks for sending us a link to your paper. It's great to see new research in the EBMs/GA2Ms visualization space!

You can find out if there were missing values in the dataset by looking at the field ebm.preprocessor_.col_bin_counts_[feature_index][0], which should contain the count of missing values observed.

Yes, the resulting models should look similar to how it would look if missing values were given an extreme outlier value, with the additional benefit that missing values are now guaranteed to be in their own bin and not merged with real data if there are insufficient amounts of missing values to put them into their own bin. If you have a dataset with no missing values for a feature, then the resulting model should have logits of zero in the missing value bin. If you are seeing something other than zero, then we need to figure that out. Boosting will create an illusionary value in the missing bin if there is no data there, but we force the value to zero after post-processing.

For 1D, your representation is correct. To be a little clearer on how this works for pairs, if before you had two binary features and the following matrix for the pair in the model:
[0.1 0.2]
[0.3 0.4]

What you would now get with the missing values change is a 3rd missing bin on each dimension, so a 3x3 matrix that looks like this:

[0.0 0.0 0.0]
[0.0 0.1 0.2]
[0.0 0.3 0.4]

With the interesting aspect that there is now a bin for the case where both features are missing in the [0, 0] location. This missing value handling method can continue to higher dimensions once we start supporting 3 way and higher tensor interactions.

IMHO, missing value handling is going to be one of the more interesting aspects of the EBM model class. One important aspect is that we'll likely be able to make reasonable predictions in scenarios where the training data has no missing values, but they occur during prediction. So, for instance if you had a sensor of some kind that was working while training data was collected, but which failed at a later date, EBMs will still be able to use the other features available to make reasonable predictions. Each feature is independently mean centered, so if a feature is missing, a reasonable choice is to simply set it to the expected value of 0. Of course, you could do even better by retraining the model if there is a change in a feature definition like this, but we expect that setting to zero will be a reasonable choice in many applications for handling unexpected changes in the data.

The other interesting aspect is that EBMs should make it visually clear if the missing values contain any information by the fact that they are missing. If the missing values have a signal, as they often do in medicine where diagnostic tests are not ordered when they don't seem relevant, we should learn a signal from the fact that data is missing, and the resulting model should have a non-zero value in the missing bin. In cases where missing values were caused by a more random process like flipping a coin, you would expect the value in the missing bin to be closer to zero. This will more likely be true though after we implement the XGBoost method of training as it should be more balanced.

I would expect better missing value handling to take longer than a few weeks to make it into the package, in part because it's a new moving part and we want to give the new representation some time to solidify throughout our codebase before relying on it more. We also need to tackle getting missing values into our visualization system, so that they can be seen by our users.

-InterpretML team

@candalfigomoro
Copy link

@interpret-ml
Is there any news regarding support for missing values? Thanks

@xiaohk
Copy link
Contributor

xiaohk commented Oct 6, 2021

Hello @stefanhgm, @timvink, @p9anand, @antonkulaga, @candalfigomoro! Thank you so much for using EBM! I am Jay Wang, a research intern at the InterpretML team. We are developing a new visualization tool for EBM and recruiting participants for a user study (see #283 for more details).

We think you are a good fit for this paid user study! If you are interested, you can sign up with the link in #283. Let me know if you have any question. Thank you!

BTW I really enjoy reading your paper "An Evaluation of the Doctor-Interpretability of Generalized Additive Models with Interactions" @stefanhgm!

@stefanhgm
Copy link

stefanhgm commented Oct 11, 2021

Hi @xiaohk

Thank you very much! Btw we also released the code for visualization used in our experiments. It is based on Java Spring and a JS frontend and includes treatment of missing values: https://github.com/stefanhgm/EBM-Java-UI. We will also publish another paper where we use EBMs to predict ICU readmission very soon. It includes inspection and model editing through a team of doctors.

Your study sounds very cool, I just signed up. Thanks for the hint!

@xiaohk
Copy link
Contributor

xiaohk commented Oct 11, 2021

@stefanhgm Thanks! Your visualization tool looks very cool! I will send you an email for more details about the study.

@candalfigomoro
Copy link

@interpret-ml
Missing values are pretty common in real-world business data. Is the EBM able to handle them without having to impute them (like LightGBM does)? Imputation may be meaningless in some cases.

@paulbkoch
Copy link
Collaborator

Missing values are now exposed in the latest v0.3.0 release without the need to modify the code. We still don't have UI to show them, but we instead give a warning to indicate this and indicate how to access the missing value scores programatically. Leaving the issue open until we have UI to show them.

@paulbkoch paulbkoch mentioned this issue Jan 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

8 participants