-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conflict with PermutationImportance, DataFrame and XGBoost (with workaround) #256
Comments
In addition I confirm that I have the same issue with
In this case I have a classifier. |
I've written some test code so we can easily reproduce this bug (at PyDataLondon 2018 Conference):
This code makes a 15 element X and y with XGBClassifier and turns this into a DataFrame. When we run this in
If we use
If we use the original
|
The above was posted at https://pydata.org/london2018/ whilst I ran a "Make your first open source contribution" session. Possibly one of my attendees will recreate this bug to confirm that they see the same issue. I'm also hoping that someone has a crack at making two tests:
|
I recreated the error on my machine (ubuntu 16.04):
This is the output with pandas DataFrame:
Everything works fine with |
Also recreated on mac 10.12.6 outside of ipython. pandas==0.22.0
|
Very inconvenient issue, I really hope someone gets time to look into this! |
By way of an update I still see this bug using a fresh conda environment,
In addition using the solution with Swapping In addition the lines The above warnings look very similar to this bug in |
Can confirm this issue is still active |
Created a fix PR for this issue here. Also in the mean time, there is a better work around that doesn't require retraining a model, simply pass the below scorer into the PermutationImportance at init: def scorer(model, X, y):
return model.score(pd.DataFrame(X, columns=X_train.columns), y)
perm = PermutationImportance(est, scoring=scorer).fit(X_test, y_test) |
Since #296 is merged, I think this can be closed, thanks @hofesh for the fix, @ianozsvald for reporting and diagnosing, and to everyone who contributed. If there is anything not fixed yet, please comment and we'll reopen (or open a new one). |
I'm working on a regression problem in insurance. XGB outperforms sklearn's tree ensembles. ELI5's show_weights works fine if I use
XGBRegressor
orRandomForestRegressor
, if I usePermutationImportance
thenRandomForestRegressor
works butXGBRegressor
throws an error. I have a workaround noted below, I'm not sure where the problem lies. Any guidance would be happily received.Estimator:
If I train
est
on theDataFrame
version of the data - training, testing and ELI5'sshow_weights
are finebut
PermutationImportance
throws a feature-name error:If I change the type of the training data to an
ndarray
thenPermutationImportance
no longer throws an error:I'll note that with sklearn's
RandomForestRegressor
I can use PandasDataFrames
inPermutationImportance
without problem, it is only an issue with XGBoost.My features are two continuous values and one binary indicator, I have no categorical features. This is a smaller set of features from a larger problem (where the problem originated).
I got the clue to try
as_matrix
in this XGBoost bug report on sparse matrices - this might be a red herring as I'm not using any sparse matrices at all: dmlc/xgboost#1238 (comment)Possibly related - this ELI5 bug report on Pandas and XGBoost has linked code, that code has a similar error (
In[79]
withshow_predictions
) to my error: #166 https://github.com/nareshshah139/titanic_rebranded/blob/master/ELI5_Example.ipynb but potentially as @kmike notedget_dummies
is called separately on the train and validation set, so maybe they don't have matching feature names in this example?
The text was updated successfully, but these errors were encountered: