feature_names mismatch when using xgboost + sklearn (XGBClassifier) + eli5(explain_prediction) #2334

robinmohseni · 2017-05-22T10:13:08Z

Environment info

Operating System: Windows

Package used (python): pandas df, xgboost, sklearn (XGBClassifier) and ELI5 http://eli5.readthedocs.io/en/latest/libraries/xgboost.html

xgboost version used: 0.6

Hi

I have a trained XGBClassifier in python, and I am trying to call the explain_prediction or show_prediction functions in the ELI5 package. This function allows you to interpret the prediction for a given set of features.

The classifier was trained on a pandas dataframe and the predict (sklearn) and show_weights (ELI5) functions work perfectly. I can also interrogate the trees using booster.get_dump() with no issues.

Has anyone got any ideas on how to resolve this?

This may be related to #1441 but I am generating predictions with no error.

Thank you

R

CODE BELOW***

#train is the training data frame (pandas)
#best_gbm.gbm is a trained XGBClassifer on the train dataset
#I have also passed feature_names = predictors but this throws the same error
show_prediction(best_gbm.gbm, train.iloc[1])

Traceback (most recent call last):

File "", line 1, in
show_prediction(best_gbm.gbm, train.iloc[1])

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\eli5-0.4.2-py3.6.egg\eli5\ipython.py", line 242, in show_prediction
expl = explain_prediction(estimator, doc, **explain_kwargs)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\Lib\site-packages\singledispatch.py", line 210, in wrapper
return dispatch(args[0].class)(*args, **kw)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\eli5-0.4.2-py3.6.egg\eli5\xgboost.py", line 168, in explain_prediction_xgboost
proba = predict_proba(xgb, X)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\eli5-0.4.2-py3.6.egg\eli5\sklearn\utils.py", line 49, in predict_proba
proba, = clf.predict_proba(X)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\xgboost-0.6-py3.6.egg\xgboost\sklearn.py", line 496, in predict_proba
ntree_limit=ntree_limit)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\xgboost-0.6-py3.6.egg\xgboost\core.py", line 950, in predict

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\xgboost-0.6-py3.6.egg\xgboost\core.py", line 1193, in _validate_features
raise ValueError(msg.format(self.feature_names,

ValueError: feature_names mismatch:

sensitive info removed
training data did not have the following fields: f73, f40, f66, f147, f62, f39, f2, f83, f127, f84, f54, f97, f114, f102, f49, f7, f8, f56, f23, f107, f138, f28, f71, f152, f80, f57, f46, f58, f139, f121, f140, f20, f45, f113, f5, f60, f135, f101, f68, f76, f65, f41, f99, f131, f109, f117, f13, f100, f128, f52, f15, f50, f95, f124, f19, f12, f43, f137, f33, f22, f32, f72, f142, f151, f74, f90, f48, f122, f133, f26, f79, f94, f18, f10, f51, f0, f53, f92, f29, f115, f143, f14, f116, f47, f69, f82, f34, f89, f35, f6, f132, f16, f118, f31, f96, f59, f75, f1, f110, f61, f108, f25, f21, f11, f17, f85, f150, f3, f98, f24, f77, f103, f112, f91, f144, f70, f86, f119, f55, f130, f106, f44, f36, f64, f67, f4, f145, f37, f126, f88, f93, f104, f81, f149, f27, f136, f146, f30, f38, f42, f141, f134, f120, f105, f129, f9, f148, f87, f125, f123, f111, f78, f63

The text was updated successfully, but these errors were encountered:

fxc123 · 2017-05-23T00:37:12Z

I got the same error, is there any solution now?

bpben · 2017-05-26T18:21:16Z

Same error as well. Mine only occurs when I try to use CalibratedClassifierCV from sklearn with the prefit option (i.e. passing it a prefit XGBClassifier object). If I pass it without fitting the classifier, it works fine. I can't figure out what exactly is going on, but it seems like something with the translation from DataFrame to DMatrix and back.

swoopyy · 2017-06-11T14:58:53Z

I had the same error when inserted columns to the test set not in the same order as they were inserted to the train. Changing the order of inserting solved the problem.

Mayanksoni20 · 2017-07-21T13:16:15Z

This error comes up because the training set has more features in the matrix.

You have to create feature again including the train+test data set (Just to create features , not to train the model) and then run it on prediction.

ChelyYi · 2017-09-21T08:19:48Z

I have the same problem, Does anyone solve this?
It the first time I meet this error, I'm sure it isn't caused by incorrect feature number or order.
When I use other model in sklearn, it doesn't have this error, I wonder maybe it caused by xgboost itself?

Mayanksoni20 · 2017-09-21T08:59:46Z

This problem is because of the following - The test and train datasets should have the matrix in the same columns in the same order. I solved it by emptying the train matrix and filling data from test by matching columns.

…

On Thu, Sep 21, 2017 at 1:50 PM, ChelyYi ***@***.***> wrote: I have the same problem, Does anyone solve this? It the first time I meet this error, I'm sure it isn't caused by incorrect feature number or order. When I use other model in sklearn, it doesn't have this error, I wonder maybe it caused by xgboost itself? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2334 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG60JqgyGWhptJ-wN8TocxZCGysePoMPks5skhw-gaJpZM4NiJro> .

justinrgarrard · 2017-09-29T17:57:11Z

Hey, y'all. I spent most of the day fiddling around with a similar problem and I'd like to offer my two bits. There's more than one root cause for this error. As such, there's more than one fix depending on the origins of your problem.

Your columns are out of alignment. This one can be fixed by doing something like what Mayanksoni20 suggests. Here's one example, taken from another issue thread:

test = test[train.columns]

Your data has NA's or other objects in it. XGB and Pandas don't like each other much. The DMatrix object that Dataframes get converted into can choke on a lot of things. One way to handle this is by scrubbing the data down, using this method (courtesy of Rocketq on StackOverFlow):

from sklearn import preprocessing 
for f in train.columns: 
    if train[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder() 
        lbl.fit(list(train[f].values)) 
        train[f] = lbl.transform(list(train[f].values))

for f in test.columns: 
    if test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder() 
        lbl.fit(list(test[f].values)) 
        test[f] = lbl.transform(list(test[f].values))

train.fillna((-999), inplace=True) 
test.fillna((-999), inplace=True)

train=np.array(train) 
test=np.array(test) 
train = train.astype(float) 
test = test.astype(float)

Your prediction data has zeros in it. This swings back to the sparse matrix implementation used for DMatrix. If the tail values of your prediction are zeros, they'll be cut off (leaving you with various missing columns like f10, f11, and so on). This was my problem. Luckily, the fix isn't too hard. Just replace the last element with a very small number, like so:

x = [features_test[i]]
if x[0][-1] == 0:
         x[0][-1] = 0.0000001
pred = int(xgb_regressor.predict(x))

Hope that helps.

Mayanksoni20 · 2017-10-03T10:22:48Z

This can also be sorted by replacing "vectorizer.fit_transform()" to "vectorizer.transform()" in the TF-IDF matrix.

The matrix is taken care by itself.

g0lemXIV · 2017-11-18T20:42:12Z

Change column name also work for me. I have a predict model with Kfold validation 👍

nishkalavallabhi · 2017-12-08T17:33:55Z

I also have this issue, but only if I do train-test. If I do cross-validation, it runs without errors.

MaxPowerWasTaken · 2018-02-21T02:23:40Z

@justinrgarrard do you have a reproduceable example for your last case? the first couple cases you mention (passing columns out of alignment, passing object dtype) seem like user error. it's on us as users to pass xgb.fit and xgb.predict a consistently ordered and labeled train/test set with valid dtypes.

but if calling predict on a dataset with identical order and column-names as the dataset .fit() was run on throws an error because of zeroes in one of the columns of the test/predict-set, that to me would be an xgb bug.

MaxPowerWasTaken · 2018-02-21T02:37:34Z

^ looks like @justinrgarrard is correct, this seems to be a known issue with at least certain versions of xgboost. from these two issue tickets, it seems upgrading to the latest version of xgboost solves it in some but possibly not all cases? see a few references to hack-fixes like justin's and others.
#1091
#1238

justinrgarrard · 2018-02-21T03:04:26Z

@MaxPowerWasTaken glad to hear that progress has been made on that front. Might be worth keeping an ear out for any future mentions of similar circumstances.

Bingohong · 2018-04-04T15:28:34Z

you can convert dataframe to numpy array by .as_matrix()
x_test = x_test.as_matrix()
x_train = x_train.as_matrix()
then refit model, it is ok~

yxjsxy · 2018-04-17T17:18:43Z

change it into numpy array and solve this issue. I think maybe because the original train matrix you use is a numpy array but now it is pandas dataframe, which causes the issue.

natskr · 2018-04-30T14:09:47Z

I had a very similar issue with ELI5 and xgboost, but my column order between train and test are identical and I have no NULL or NaN values in my dataset. I solved the issue by following @Bingohong advise, please see my code below:

X_matrix = X.as_matrix()
y_matrix = y.as_matrix()

xgb = XGBClassifier()
xgb.fit(X_matrix, y_matrix)
perm_xgb = PermutationImportance(xgb).fit(X_matrix, y_matrix)
eli5.show_weights(perm_xgb, feature_names=list(X))

This may also be related to: TeamHG-Memex/eli5#256

pdesainteagathe · 2018-05-22T16:09:10Z

Same issue with xgbRegressor sklearn API, when trying to predict for a single line of the test dataset. justinrgarrard's advices did not work for me.

iwangzhengchao · 2018-05-25T02:49:57Z

Use the .value attribute to convert pandas to ndarray.

GDBSD · 2018-06-11T19:57:16Z

Check the exception. What you should see are two arrays. One is the column names of the dataframe you’re passing in and the other is the XGBoost feature names. They should be the same length. If you put them side by side in an Excel spreadsheet you will see that they are bot in the same order. My guess is that the XGBoost names were written to a dictionary so it would be a coincidence if the names in then two arrays were in the same order. The fix is easy. Just reorder your dataframe columns to match the XGBoost names:

f_names = model.feature_names
df = df[f_names]```

Ewande · 2018-07-19T13:48:08Z

I had the same error with pandas.DataFrame, XGBClassifier and CalibratedClassifierCV (with prefit option).

Short explanation what's going on in this case:

we train XGBClassifier using data in pandas.DataFrame (X_train), so the Booster object inside XGBClassifier saves pandas column names as feature names (e.g. ['a', 'b', 'c'])
having XGBClassifier trained, we want to calibrate it, so we run CalibratedClassifier(model, cv='prefit').fit(X_val, y_val) (as X_train was a pandas.DataFrame, so is X_val)
inside fit function, X_val is converted to numpy.ndarray, then it is passed to predict_proba method of XGBClassifier
inside XGBClassifier.predict_proba, input is converted to DMatrix before running the prediction process
if DMatrix gets pandas.DataFrame, it sets feature_names to pandas column names (as in the training process), but in this case it gets numpy.ndarray, so feature_names is set to [f0, f1, f2, ...]
feature_names in the prediction input is compared with feature_names of the trained Booster object and we get a mismatch

Why converting to numpy.ndarray helps:
Converting X_train into numpy.ndarray makes XGBClassifier save [f0, f1, f2, ...] as feature names (instead of ['a', 'b', 'c']) and then there is no mismatch during fitting (and during prediction) of CalibratedClassifierCV.

Summing up:
I think it's not a bug in xgboost, but rather an incompatibility of sklearn's CalibratedClassifierCV and pandas.DataFrame.

kylecampbell · 2018-08-21T16:18:12Z

If model was trained using dataframes then the following may help.

When predicting on entire dataframe:
This works, model.predict(df[predictors])

When making a single prediction from dataframe:
model.predict(df[predictors].iloc[-1])
Returns feature_mismatch errors like mentioned above, missing f0, f1, ...
This works, model.predict(df[predictors].iloc[[-1]])
Note the iloc row is in double brackets to maintain dataframe.

orasis · 2018-11-02T12:39:50Z

I'm experiencing this with feature hashing. The columns are in perfect alignment, but the test set has one fewer feature at the last feature position than the training set, because the feature hasher on the training set never saw this feature. Frankly this is going to be challenging to fix because our training and test pipelines are separate and we're using SVM files since the data set is so large.

It would be nice if there were an option to ignore this error.

pommedeterresautee added the type: python label May 30, 2017

tqchen closed this as completed Jul 4, 2018

jaksmid mentioned this issue Aug 9, 2018

XGboost Feature mismatch EpistasisLab/tpot#738

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature_names mismatch when using xgboost + sklearn (XGBClassifier) + eli5(explain_prediction) #2334

feature_names mismatch when using xgboost + sklearn (XGBClassifier) + eli5(explain_prediction) #2334

robinmohseni commented May 22, 2017

fxc123 commented May 23, 2017

bpben commented May 26, 2017

swoopyy commented Jun 11, 2017

Mayanksoni20 commented Jul 21, 2017

ChelyYi commented Sep 21, 2017

Mayanksoni20 commented Sep 21, 2017 via email

justinrgarrard commented Sep 29, 2017 •

edited

Mayanksoni20 commented Oct 3, 2017 •

edited

g0lemXIV commented Nov 18, 2017

nishkalavallabhi commented Dec 8, 2017

MaxPowerWasTaken commented Feb 21, 2018

MaxPowerWasTaken commented Feb 21, 2018

justinrgarrard commented Feb 21, 2018

Bingohong commented Apr 4, 2018

yxjsxy commented Apr 17, 2018

natskr commented Apr 30, 2018 •

edited

pdesainteagathe commented May 22, 2018

iwangzhengchao commented May 25, 2018

GDBSD commented Jun 11, 2018 •

edited

Ewande commented Jul 19, 2018

kylecampbell commented Aug 21, 2018 •

edited

orasis commented Nov 2, 2018

feature_names mismatch when using xgboost + sklearn (XGBClassifier) + eli5(explain_prediction) #2334

feature_names mismatch when using xgboost + sklearn (XGBClassifier) + eli5(explain_prediction) #2334

Comments

robinmohseni commented May 22, 2017

Environment info

fxc123 commented May 23, 2017

bpben commented May 26, 2017

swoopyy commented Jun 11, 2017

Mayanksoni20 commented Jul 21, 2017

ChelyYi commented Sep 21, 2017

Mayanksoni20 commented Sep 21, 2017 via email

justinrgarrard commented Sep 29, 2017 • edited

Mayanksoni20 commented Oct 3, 2017 • edited

g0lemXIV commented Nov 18, 2017

nishkalavallabhi commented Dec 8, 2017

MaxPowerWasTaken commented Feb 21, 2018

MaxPowerWasTaken commented Feb 21, 2018

justinrgarrard commented Feb 21, 2018

Bingohong commented Apr 4, 2018

yxjsxy commented Apr 17, 2018

natskr commented Apr 30, 2018 • edited

pdesainteagathe commented May 22, 2018

iwangzhengchao commented May 25, 2018

GDBSD commented Jun 11, 2018 • edited

Ewande commented Jul 19, 2018

kylecampbell commented Aug 21, 2018 • edited

orasis commented Nov 2, 2018

justinrgarrard commented Sep 29, 2017 •

edited

Mayanksoni20 commented Oct 3, 2017 •

edited

natskr commented Apr 30, 2018 •

edited

GDBSD commented Jun 11, 2018 •

edited

kylecampbell commented Aug 21, 2018 •

edited