Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature_names mismatch on sparse matrices #1441

Closed
inexxt opened this issue Aug 5, 2016 · 6 comments
Closed

feature_names mismatch on sparse matrices #1441

inexxt opened this issue Aug 5, 2016 · 6 comments

Comments

@inexxt
Copy link

inexxt commented Aug 5, 2016

Hi,

I'm have some problems with CSR sparse matrices. I train the model on dataset created by sklearn TfidfVectorizer, then use the same vectorizer to transform test dataset.
During prediction following error occurs:

<ipython-input-48-bb14ec4ec14c> in <module>()
      2 
      3 for cat, classifier in tqdm(classifiers.items()):
----> 4     pred[cat] = classifier.predict_proba(words)

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in predict_proba(self, data, output_margin, ntree_limit)
    475         class_probs = self.booster().predict(test_dmatrix,
    476                                              output_margin=output_margin,
--> 477                                              ntree_limit=ntree_limit)
    478         if self.objective == "multi:softprob":
    479             return class_probs

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):


ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', ...
f38732', 'f38733', 'f38734', 'f38735', 'f38736', 'f38737', 'f38738', 'f38739'] []
expected f4057, f36350, f1683, f1914, f33121, f16637, f21443, f10995, f36221, f24340, f15968, f7863, f38732, ...
f19897, f33500, f37792, f30259, f20094, f27943, f5788, f14369, f9074 in input data

The dimensions are the same in training and prediction time, the second list appears to be permutation of the first (... symbol is mine, log is very long).

It may be related to #1238

@bahshetsian
Copy link

bahshetsian commented Aug 7, 2016

You need transform sparse matrix to array, like this:

xg_train = xgb.DMatrix(X_train.toarray(), label=y_train)
xg_test = xgb.DMatrix(X_test.toarray())

@dmcgarry
Copy link

dmcgarry commented Sep 1, 2016

@SpLin12 No, that would be ridiculous. Converting a sparse array to be dense is not an intended fix nor is it possible in the majority of sparse feature spaces.

@inexxt
Copy link
Author

inexxt commented Sep 2, 2016

It's definitely #1238, code works fine with csc matrices (although the performance is ~20% worse)

@nazirmubbashir
Copy link

passing a dataframe solved the issue for me

@jpbm
Copy link

jpbm commented Mar 1, 2017

@nazirmubbashir could you clarify?

@nazirmubbashir
Copy link

@jpbm I used pandas dataframes and passed them directly, without converting to arrays.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants