-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature_names mismatch when using xgboost + sklearn (XGBClassifier) + eli5(explain_prediction) #2334
Comments
I got the same error, is there any solution now? |
Same error as well. Mine only occurs when I try to use CalibratedClassifierCV from sklearn with the prefit option (i.e. passing it a prefit XGBClassifier object). If I pass it without fitting the classifier, it works fine. I can't figure out what exactly is going on, but it seems like something with the translation from DataFrame to DMatrix and back. |
I had the same error when inserted columns to the test set not in the same order as they were inserted to the train. Changing the order of inserting solved the problem. |
This error comes up because the training set has more features in the matrix. You have to create feature again including the train+test data set (Just to create features , not to train the model) and then run it on prediction. |
I have the same problem, Does anyone solve this? |
This problem is because of the following -
The test and train datasets should have the matrix in the same columns in
the same order.
I solved it by emptying the train matrix and filling data from test by
matching columns.
…On Thu, Sep 21, 2017 at 1:50 PM, ChelyYi ***@***.***> wrote:
I have the same problem, Does anyone solve this?
It the first time I meet this error, I'm sure it isn't caused by incorrect
feature number or order.
When I use other model in sklearn, it doesn't have this error, I wonder
maybe it caused by xgboost itself?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2334 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AG60JqgyGWhptJ-wN8TocxZCGysePoMPks5skhw-gaJpZM4NiJro>
.
|
Hey, y'all. I spent most of the day fiddling around with a similar problem and I'd like to offer my two bits. There's more than one root cause for this error. As such, there's more than one fix depending on the origins of your problem.
Hope that helps. |
This can also be sorted by replacing "vectorizer.fit_transform()" to "vectorizer.transform()" in the TF-IDF matrix. The matrix is taken care by itself. |
Change column name also work for me. I have a predict model with Kfold validation 👍 |
I also have this issue, but only if I do train-test. If I do cross-validation, it runs without errors. |
@justinrgarrard do you have a reproduceable example for your last case? the first couple cases you mention (passing columns out of alignment, passing object dtype) seem like user error. it's on us as users to pass but if calling predict on a dataset with identical order and column-names as the dataset |
^ looks like @justinrgarrard is correct, this seems to be a known issue with at least certain versions of xgboost. from these two issue tickets, it seems upgrading to the latest version of xgboost solves it in some but possibly not all cases? see a few references to hack-fixes like justin's and others. |
@MaxPowerWasTaken glad to hear that progress has been made on that front. Might be worth keeping an ear out for any future mentions of similar circumstances. |
you can convert dataframe to numpy array by .as_matrix() |
change it into numpy array and solve this issue. I think maybe because the original train matrix you use is a numpy array but now it is pandas dataframe, which causes the issue. |
I had a very similar issue with ELI5 and xgboost, but my column order between train and test are identical and I have no NULL or NaN values in my dataset. I solved the issue by following @Bingohong advise, please see my code below:
This may also be related to: TeamHG-Memex/eli5#256 |
Same issue with xgbRegressor sklearn API, when trying to predict for a single line of the test dataset. justinrgarrard's advices did not work for me. |
Use the |
Check the exception. What you should see are two arrays. One is the column names of the dataframe you’re passing in and the other is the XGBoost feature names. They should be the same length. If you put them side by side in an Excel spreadsheet you will see that they are bot in the same order. My guess is that the XGBoost names were written to a dictionary so it would be a coincidence if the names in then two arrays were in the same order. The fix is easy. Just reorder your dataframe columns to match the XGBoost names:
|
I had the same error with pandas.DataFrame, XGBClassifier and CalibratedClassifierCV (with prefit option). Short explanation what's going on in this case:
Why converting to numpy.ndarray helps: Summing up: |
If model was trained using dataframes then the following may help. When predicting on entire dataframe: When making a single prediction from dataframe: |
I'm experiencing this with feature hashing. The columns are in perfect alignment, but the test set has one fewer feature at the last feature position than the training set, because the feature hasher on the training set never saw this feature. Frankly this is going to be challenging to fix because our training and test pipelines are separate and we're using SVM files since the data set is so large. It would be nice if there were an option to ignore this error. |
Environment info
Operating System: Windows
Package used (python): pandas df, xgboost, sklearn (XGBClassifier) and ELI5 http://eli5.readthedocs.io/en/latest/libraries/xgboost.html
xgboost
version used: 0.6Hi
I have a trained XGBClassifier in python, and I am trying to call the explain_prediction or show_prediction functions in the ELI5 package. This function allows you to interpret the prediction for a given set of features.
The classifier was trained on a pandas dataframe and the predict (sklearn) and show_weights (ELI5) functions work perfectly. I can also interrogate the trees using booster.get_dump() with no issues.
Has anyone got any ideas on how to resolve this?
This may be related to #1441 but I am generating predictions with no error.
Thank you
R
CODE BELOW***
#train is the training data frame (pandas)
#best_gbm.gbm is a trained XGBClassifer on the train dataset
#I have also passed feature_names = predictors but this throws the same error
show_prediction(best_gbm.gbm, train.iloc[1])
Traceback (most recent call last):
File "", line 1, in
show_prediction(best_gbm.gbm, train.iloc[1])
File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\eli5-0.4.2-py3.6.egg\eli5\ipython.py", line 242, in show_prediction
expl = explain_prediction(estimator, doc, **explain_kwargs)
File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\Lib\site-packages\singledispatch.py", line 210, in wrapper
return dispatch(args[0].class)(*args, **kw)
File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\Lib\site-packages\singledispatch.py", line 210, in wrapper
return dispatch(args[0].class)(*args, **kw)
File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\eli5-0.4.2-py3.6.egg\eli5\xgboost.py", line 168, in explain_prediction_xgboost
proba = predict_proba(xgb, X)
File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\eli5-0.4.2-py3.6.egg\eli5\sklearn\utils.py", line 49, in predict_proba
proba, = clf.predict_proba(X)
File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\xgboost-0.6-py3.6.egg\xgboost\sklearn.py", line 496, in predict_proba
ntree_limit=ntree_limit)
File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\xgboost-0.6-py3.6.egg\xgboost\core.py", line 950, in predict
File "C:\Users\935404\AppData\Local\Continuum\Anaconda3\lib\site-packages\xgboost-0.6-py3.6.egg\xgboost\core.py", line 1193, in _validate_features
raise ValueError(msg.format(self.feature_names,
ValueError: feature_names mismatch:
sensitive info removed
training data did not have the following fields: f73, f40, f66, f147, f62, f39, f2, f83, f127, f84, f54, f97, f114, f102, f49, f7, f8, f56, f23, f107, f138, f28, f71, f152, f80, f57, f46, f58, f139, f121, f140, f20, f45, f113, f5, f60, f135, f101, f68, f76, f65, f41, f99, f131, f109, f117, f13, f100, f128, f52, f15, f50, f95, f124, f19, f12, f43, f137, f33, f22, f32, f72, f142, f151, f74, f90, f48, f122, f133, f26, f79, f94, f18, f10, f51, f0, f53, f92, f29, f115, f143, f14, f116, f47, f69, f82, f34, f89, f35, f6, f132, f16, f118, f31, f96, f59, f75, f1, f110, f61, f108, f25, f21, f11, f17, f85, f150, f3, f98, f24, f77, f103, f112, f91, f144, f70, f86, f119, f55, f130, f106, f44, f36, f64, f67, f4, f145, f37, f126, f88, f93, f104, f81, f149, f27, f136, f146, f30, f38, f42, f141, f134, f120, f105, f129, f9, f148, f87, f125, f123, f111, f78, f63
The text was updated successfully, but these errors were encountered: