feature_names mismatch on sparse matrices #1441

inexxt · 2016-08-05T17:07:26Z

Hi,

I'm have some problems with CSR sparse matrices. I train the model on dataset created by sklearn TfidfVectorizer, then use the same vectorizer to transform test dataset.
During prediction following error occurs:

<ipython-input-48-bb14ec4ec14c> in <module>()
      2 
      3 for cat, classifier in tqdm(classifiers.items()):
----> 4     pred[cat] = classifier.predict_proba(words)

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in predict_proba(self, data, output_margin, ntree_limit)
    475         class_probs = self.booster().predict(test_dmatrix,
    476                                              output_margin=output_margin,
--> 477                                              ntree_limit=ntree_limit)
    478         if self.objective == "multi:softprob":
    479             return class_probs

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):


ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', ...
f38732', 'f38733', 'f38734', 'f38735', 'f38736', 'f38737', 'f38738', 'f38739'] []
expected f4057, f36350, f1683, f1914, f33121, f16637, f21443, f10995, f36221, f24340, f15968, f7863, f38732, ...
f19897, f33500, f37792, f30259, f20094, f27943, f5788, f14369, f9074 in input data

The dimensions are the same in training and prediction time, the second list appears to be permutation of the first (... symbol is mine, log is very long).

It may be related to #1238

The text was updated successfully, but these errors were encountered:

bahshetsian · 2016-08-07T08:40:20Z

You need transform sparse matrix to array, like this:

xg_train = xgb.DMatrix(X_train.toarray(), label=y_train)
xg_test = xgb.DMatrix(X_test.toarray())

dmcgarry · 2016-09-01T16:19:38Z

@SpLin12 No, that would be ridiculous. Converting a sparse array to be dense is not an intended fix nor is it possible in the majority of sparse feature spaces.

inexxt · 2016-09-02T04:12:09Z

It's definitely #1238, code works fine with csc matrices (although the performance is ~20% worse)

nazirmubbashir · 2016-11-23T09:12:26Z

passing a dataframe solved the issue for me

jpbm · 2017-03-01T01:20:52Z

@nazirmubbashir could you clarify?

nazirmubbashir · 2017-03-01T05:50:31Z

@jpbm I used pandas dataframes and passed them directly, without converting to arrays.

robinmohseni mentioned this issue May 22, 2017

feature_names mismatch when using xgboost + sklearn (XGBClassifier) + eli5(explain_prediction) #2334

Closed

tqchen closed this as completed Jul 4, 2018

lock bot locked as resolved and limited conversation to collaborators Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature_names mismatch on sparse matrices #1441

feature_names mismatch on sparse matrices #1441

inexxt commented Aug 5, 2016 •

edited

bahshetsian commented Aug 7, 2016 •

edited

dmcgarry commented Sep 1, 2016

inexxt commented Sep 2, 2016

nazirmubbashir commented Nov 23, 2016

jpbm commented Mar 1, 2017

nazirmubbashir commented Mar 1, 2017

feature_names mismatch on sparse matrices #1441

feature_names mismatch on sparse matrices #1441

Comments

inexxt commented Aug 5, 2016 • edited

bahshetsian commented Aug 7, 2016 • edited

dmcgarry commented Sep 1, 2016

inexxt commented Sep 2, 2016

nazirmubbashir commented Nov 23, 2016

jpbm commented Mar 1, 2017

nazirmubbashir commented Mar 1, 2017

inexxt commented Aug 5, 2016 •

edited

bahshetsian commented Aug 7, 2016 •

edited