Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature_names mismatch when using xgboost + sklearn (XGBClassifier) + eli5(explain_prediction) #2334

Closed
robinmohseni opened this issue May 22, 2017 · 22 comments

Comments

@robinmohseni
Copy link

Environment info

Operating System: Windows

Package used (python): pandas df, xgboost, sklearn (XGBClassifier) and ELI5 http://eli5.readthedocs.io/en/latest/libraries/xgboost.html

xgboost version used: 0.6

Hi

I have a trained XGBClassifier in python, and I am trying to call the explain_prediction or show_prediction functions in the ELI5 package. This function allows you to interpret the prediction for a given set of features.

The classifier was trained on a pandas dataframe and the predict (sklearn) and show_weights (ELI5) functions work perfectly. I can also interrogate the trees using booster.get_dump() with no issues.

Has anyone got any ideas on how to resolve this?

This may be related to #1441 but I am generating predictions with no error.

Thank you

R

CODE BELOW***

#train is the training data frame (pandas)
#best_gbm.gbm is a trained XGBClassifer on the train dataset
#I have also passed feature_names = predictors but this throws the same error
show_prediction(best_gbm.gbm, train.iloc[1])

Traceback (most recent call last):

File "", line 1, in
show_prediction(best_gbm.gbm, train.iloc[1])

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\eli5-0.4.2-py3.6.egg\eli5\ipython.py", line 242, in show_prediction
expl = explain_prediction(estimator, doc, **explain_kwargs)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\Lib\site-packages\singledispatch.py", line 210, in wrapper
return dispatch(args[0].class)(*args, **kw)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\Lib\site-packages\singledispatch.py", line 210, in wrapper
return dispatch(args[0].class)(*args, **kw)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\eli5-0.4.2-py3.6.egg\eli5\xgboost.py", line 168, in explain_prediction_xgboost
proba = predict_proba(xgb, X)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\eli5-0.4.2-py3.6.egg\eli5\sklearn\utils.py", line 49, in predict_proba
proba, = clf.predict_proba(X)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\xgboost-0.6-py3.6.egg\xgboost\sklearn.py", line 496, in predict_proba
ntree_limit=ntree_limit)

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\xgboost-0.6-py3.6.egg\xgboost\core.py", line 950, in predict

File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\xgboost-0.6-py3.6.egg\xgboost\core.py", line 1193, in _validate_features
raise ValueError(msg.format(self.feature_names,

ValueError: feature_names mismatch:

sensitive info removed
training data did not have the following fields: f73, f40, f66, f147, f62, f39, f2, f83, f127, f84, f54, f97, f114, f102, f49, f7, f8, f56, f23, f107, f138, f28, f71, f152, f80, f57, f46, f58, f139, f121, f140, f20, f45, f113, f5, f60, f135, f101, f68, f76, f65, f41, f99, f131, f109, f117, f13, f100, f128, f52, f15, f50, f95, f124, f19, f12, f43, f137, f33, f22, f32, f72, f142, f151, f74, f90, f48, f122, f133, f26, f79, f94, f18, f10, f51, f0, f53, f92, f29, f115, f143, f14, f116, f47, f69, f82, f34, f89, f35, f6, f132, f16, f118, f31, f96, f59, f75, f1, f110, f61, f108, f25, f21, f11, f17, f85, f150, f3, f98, f24, f77, f103, f112, f91, f144, f70, f86, f119, f55, f130, f106, f44, f36, f64, f67, f4, f145, f37, f126, f88, f93, f104, f81, f149, f27, f136, f146, f30, f38, f42, f141, f134, f120, f105, f129, f9, f148, f87, f125, f123, f111, f78, f63

@fxc123
Copy link

fxc123 commented May 23, 2017

I got the same error, is there any solution now?

@bpben
Copy link

bpben commented May 26, 2017

Same error as well. Mine only occurs when I try to use CalibratedClassifierCV from sklearn with the prefit option (i.e. passing it a prefit XGBClassifier object). If I pass it without fitting the classifier, it works fine. I can't figure out what exactly is going on, but it seems like something with the translation from DataFrame to DMatrix and back.

@swoopyy
Copy link

swoopyy commented Jun 11, 2017

I had the same error when inserted columns to the test set not in the same order as they were inserted to the train. Changing the order of inserting solved the problem.

@Mayanksoni20
Copy link

This error comes up because the training set has more features in the matrix.

You have to create feature again including the train+test data set (Just to create features , not to train the model) and then run it on prediction.

@ChelyYi
Copy link

ChelyYi commented Sep 21, 2017

I have the same problem, Does anyone solve this?
It the first time I meet this error, I'm sure it isn't caused by incorrect feature number or order.
When I use other model in sklearn, it doesn't have this error, I wonder maybe it caused by xgboost itself?

@Mayanksoni20
Copy link

Mayanksoni20 commented Sep 21, 2017 via email

@justinrgarrard
Copy link

justinrgarrard commented Sep 29, 2017

Hey, y'all. I spent most of the day fiddling around with a similar problem and I'd like to offer my two bits. There's more than one root cause for this error. As such, there's more than one fix depending on the origins of your problem.

  • Your columns are out of alignment. This one can be fixed by doing something like what Mayanksoni20 suggests. Here's one example, taken from another issue thread:

test = test[train.columns]

  • Your data has NA's or other objects in it. XGB and Pandas don't like each other much. The DMatrix object that Dataframes get converted into can choke on a lot of things. One way to handle this is by scrubbing the data down, using this method (courtesy of Rocketq on StackOverFlow):
from sklearn import preprocessing 
for f in train.columns: 
    if train[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder() 
        lbl.fit(list(train[f].values)) 
        train[f] = lbl.transform(list(train[f].values))

for f in test.columns: 
    if test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder() 
        lbl.fit(list(test[f].values)) 
        test[f] = lbl.transform(list(test[f].values))

train.fillna((-999), inplace=True) 
test.fillna((-999), inplace=True)

train=np.array(train) 
test=np.array(test) 
train = train.astype(float) 
test = test.astype(float)
  • Your prediction data has zeros in it. This swings back to the sparse matrix implementation used for DMatrix. If the tail values of your prediction are zeros, they'll be cut off (leaving you with various missing columns like f10, f11, and so on). This was my problem. Luckily, the fix isn't too hard. Just replace the last element with a very small number, like so:
x = [features_test[i]]
if x[0][-1] == 0:
         x[0][-1] = 0.0000001
pred = int(xgb_regressor.predict(x))

Hope that helps.

@Mayanksoni20
Copy link

Mayanksoni20 commented Oct 3, 2017

This can also be sorted by replacing "vectorizer.fit_transform()" to "vectorizer.transform()" in the TF-IDF matrix.

The matrix is taken care by itself.

@g0lemXIV
Copy link

Change column name also work for me. I have a predict model with Kfold validation 👍

@nishkalavallabhi
Copy link

I also have this issue, but only if I do train-test. If I do cross-validation, it runs without errors.

@MaxPowerWasTaken
Copy link

@justinrgarrard do you have a reproduceable example for your last case? the first couple cases you mention (passing columns out of alignment, passing object dtype) seem like user error. it's on us as users to pass xgb.fit and xgb.predict a consistently ordered and labeled train/test set with valid dtypes.

but if calling predict on a dataset with identical order and column-names as the dataset .fit() was run on throws an error because of zeroes in one of the columns of the test/predict-set, that to me would be an xgb bug.

@MaxPowerWasTaken
Copy link

^ looks like @justinrgarrard is correct, this seems to be a known issue with at least certain versions of xgboost. from these two issue tickets, it seems upgrading to the latest version of xgboost solves it in some but possibly not all cases? see a few references to hack-fixes like justin's and others.
#1091
#1238

@justinrgarrard
Copy link

@MaxPowerWasTaken glad to hear that progress has been made on that front. Might be worth keeping an ear out for any future mentions of similar circumstances.

@Bingohong
Copy link

you can convert dataframe to numpy array by .as_matrix()
x_test = x_test.as_matrix()
x_train = x_train.as_matrix()
then refit model, it is ok~

@yxjsxy
Copy link

yxjsxy commented Apr 17, 2018

change it into numpy array and solve this issue. I think maybe because the original train matrix you use is a numpy array but now it is pandas dataframe, which causes the issue.

@natskr
Copy link

natskr commented Apr 30, 2018

I had a very similar issue with ELI5 and xgboost, but my column order between train and test are identical and I have no NULL or NaN values in my dataset. I solved the issue by following @Bingohong advise, please see my code below:

X_matrix = X.as_matrix()
y_matrix = y.as_matrix()

xgb = XGBClassifier()
xgb.fit(X_matrix, y_matrix)
perm_xgb = PermutationImportance(xgb).fit(X_matrix, y_matrix)
eli5.show_weights(perm_xgb, feature_names=list(X))

This may also be related to: TeamHG-Memex/eli5#256

@pdesainteagathe
Copy link

Same issue with xgbRegressor sklearn API, when trying to predict for a single line of the test dataset. justinrgarrard's advices did not work for me.

@iwangzhengchao
Copy link

Use the .value attribute to convert pandas to ndarray.

@GDBSD
Copy link

GDBSD commented Jun 11, 2018

Check the exception. What you should see are two arrays. One is the column names of the dataframe you’re passing in and the other is the XGBoost feature names. They should be the same length. If you put them side by side in an Excel spreadsheet you will see that they are bot in the same order. My guess is that the XGBoost names were written to a dictionary so it would be a coincidence if the names in then two arrays were in the same order. The fix is easy. Just reorder your dataframe columns to match the XGBoost names:

f_names = model.feature_names
df = df[f_names]```

@tqchen tqchen closed this as completed Jul 4, 2018
@Ewande
Copy link

Ewande commented Jul 19, 2018

I had the same error with pandas.DataFrame, XGBClassifier and CalibratedClassifierCV (with prefit option).

Short explanation what's going on in this case:

  • we train XGBClassifier using data in pandas.DataFrame (X_train), so the Booster object inside XGBClassifier saves pandas column names as feature names (e.g. ['a', 'b', 'c'])
  • having XGBClassifier trained, we want to calibrate it, so we run CalibratedClassifier(model, cv='prefit').fit(X_val, y_val) (as X_train was a pandas.DataFrame, so is X_val)
  • inside fit function, X_val is converted to numpy.ndarray, then it is passed to predict_proba method of XGBClassifier
  • inside XGBClassifier.predict_proba, input is converted to DMatrix before running the prediction process
  • if DMatrix gets pandas.DataFrame, it sets feature_names to pandas column names (as in the training process), but in this case it gets numpy.ndarray, so feature_names is set to [f0, f1, f2, ...]
  • feature_names in the prediction input is compared with feature_names of the trained Booster object and we get a mismatch

Why converting to numpy.ndarray helps:
Converting X_train into numpy.ndarray makes XGBClassifier save [f0, f1, f2, ...] as feature names (instead of ['a', 'b', 'c']) and then there is no mismatch during fitting (and during prediction) of CalibratedClassifierCV.

Summing up:
I think it's not a bug in xgboost, but rather an incompatibility of sklearn's CalibratedClassifierCV and pandas.DataFrame.

@kylecampbell
Copy link

kylecampbell commented Aug 21, 2018

If model was trained using dataframes then the following may help.

When predicting on entire dataframe:
This works, model.predict(df[predictors])

When making a single prediction from dataframe:
model.predict(df[predictors].iloc[-1])
Returns feature_mismatch errors like mentioned above, missing f0, f1, ...
This works, model.predict(df[predictors].iloc[[-1]])
Note the iloc row is in double brackets to maintain dataframe.

@orasis
Copy link

orasis commented Nov 2, 2018

I'm experiencing this with feature hashing. The columns are in perfect alignment, but the test set has one fewer feature at the last feature position than the training set, because the feature hasher on the training set never saw this feature. Frankly this is going to be challenging to fix because our training and test pipelines are separate and we're using SVM files since the data set is so large.

It would be nice if there were an option to ignore this error.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests