feature_names mismatch while using sparse matrices in Python #1238

abhishekkrthakur · 2016-05-31T10:42:16Z

I'm getting ValueError: feature_names mismatch while training xgboost with sparse matrices in python.
The xgboost version is latest from git. Older versions don't give this error. Error is returned during prediction time.

code

from scipy import sparse
import xgboost as xgb
from random import *
randBinList = lambda n: [randint(0,1) for b in range(1,n+1)]

train = sparse.rand(100,500)
test = sparse.rand(10, 500)
y = randBinList(100)
clf = xgb.XGBClassifier()
clf.fit(train,y)
preds = clf.predict_proba(test)

Full traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-e03f10289bf1> in <module>()
----> 1 preds = clf.predict_proba(test)

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/sklearn.pyc in predict_proba(self, data, output_margin, ntree_limit)
    471         class_probs = self.booster().predict(test_dmatrix,
    472                                              output_margin=output_margin,
--> 473                                              ntree_limit=ntree_limit)
    474         if self.objective == "multi:softprob":
    475             return class_probs

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', 'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'f100', 'f101', 'f102', 'f103', 'f104', 'f105', 'f106', 'f107', 'f108', 'f109', 'f110', 'f111', 'f112', 'f113', 'f114', 'f115', 'f116', 'f117', 'f118', 'f119', 'f120', 'f121', 'f122', 'f123', 'f124', 'f125', 'f126', 'f127', 'f128', 'f129', 'f130', 'f131', 'f132', 'f133', 'f134', 'f135', 'f136', 'f137', 'f138', 'f139', 'f140', 'f141', 'f142', 'f143', 'f144', 'f145', 'f146', 'f147', 'f148', 'f149', 'f150', 'f151', 'f152', 'f153', 'f154', 'f155', 'f156', 'f157', 'f158', 'f159', 'f160', 'f161', 'f162', 'f163', 'f164', 'f165', 'f166', 'f167', 'f168', 'f169', 'f170', 'f171', 'f172', 'f173', 'f174', 'f175', 'f176', 'f177', 'f178', 'f179', 'f180', 'f181', 'f182', 'f183', 'f184', 'f185', 'f186', 'f187', 'f188', 'f189', 'f190', 'f191', 'f192', 'f193', 'f194', 'f195', 'f196', 'f197', 'f198', 'f199', 'f200', 'f201', 'f202', 'f203', 'f204', 'f205', 'f206', 'f207', 'f208', 'f209', 'f210', 'f211', 'f212', 'f213', 'f214', 'f215', 'f216', 'f217', 'f218', 'f219', 'f220', 'f221', 'f222', 'f223', 'f224', 'f225', 'f226', 'f227', 'f228', 'f229', 'f230', 'f231', 'f232', 'f233', 'f234', 'f235', 'f236', 'f237', 'f238', 'f239', 'f240', 'f241', 'f242', 'f243', 'f244', 'f245', 'f246', 'f247', 'f248', 'f249', 'f250', 'f251', 'f252', 'f253', 'f254', 'f255', 'f256', 'f257', 'f258', 'f259', 'f260', 'f261', 'f262', 'f263', 'f264', 'f265', 'f266', 'f267', 'f268', 'f269', 'f270', 'f271', 'f272', 'f273', 'f274', 'f275', 'f276', 'f277', 'f278', 'f279', 'f280', 'f281', 'f282', 'f283', 'f284', 'f285', 'f286', 'f287', 'f288', 'f289', 'f290', 'f291', 'f292', 'f293', 'f294', 'f295', 'f296', 'f297', 'f298', 'f299', 'f300', 'f301', 'f302', 'f303', 'f304', 'f305', 'f306', 'f307', 'f308', 'f309', 'f310', 'f311', 'f312', 'f313', 'f314', 'f315', 'f316', 'f317', 'f318', 'f319', 'f320', 'f321', 'f322', 'f323', 'f324', 'f325', 'f326', 'f327', 'f328', 'f329', 'f330', 'f331', 'f332', 'f333', 'f334', 'f335', 'f336', 'f337', 'f338', 'f339', 'f340', 'f341', 'f342', 'f343', 'f344', 'f345', 'f346', 'f347', 'f348', 'f349', 'f350', 'f351', 'f352', 'f353', 'f354', 'f355', 'f356', 'f357', 'f358', 'f359', 'f360', 'f361', 'f362', 'f363', 'f364', 'f365', 'f366', 'f367', 'f368', 'f369', 'f370', 'f371', 'f372', 'f373', 'f374', 'f375', 'f376', 'f377', 'f378', 'f379', 'f380', 'f381', 'f382', 'f383', 'f384', 'f385', 'f386', 'f387', 'f388', 'f389', 'f390', 'f391', 'f392', 'f393', 'f394', 'f395', 'f396', 'f397', 'f398', 'f399', 'f400', 'f401', 'f402', 'f403', 'f404', 'f405', 'f406', 'f407', 'f408', 'f409', 'f410', 'f411', 'f412', 'f413', 'f414', 'f415', 'f416', 'f417', 'f418', 'f419', 'f420', 'f421', 'f422', 'f423', 'f424', 'f425', 'f426', 'f427', 'f428', 'f429', 'f430', 'f431', 'f432', 'f433', 'f434', 'f435', 'f436', 'f437', 'f438', 'f439', 'f440', 'f441', 'f442', 'f443', 'f444', 'f445', 'f446', 'f447', 'f448', 'f449', 'f450', 'f451', 'f452', 'f453', 'f454', 'f455', 'f456', 'f457', 'f458', 'f459', 'f460', 'f461', 'f462', 'f463', 'f464', 'f465', 'f466', 'f467', 'f468', 'f469', 'f470', 'f471', 'f472', 'f473', 'f474', 'f475', 'f476', 'f477', 'f478', 'f479', 'f480', 'f481', 'f482', 'f483', 'f484', 'f485', 'f486', 'f487', 'f488', 'f489', 'f490', 'f491', 'f492', 'f493', 'f494', 'f495', 'f496', 'f497', 'f498'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', 'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'f100', 'f101', 'f102', 'f103', 'f104', 'f105', 'f106', 'f107', 'f108', 'f109', 'f110', 'f111', 'f112', 'f113', 'f114', 'f115', 'f116', 'f117', 'f118', 'f119', 'f120', 'f121', 'f122', 'f123', 'f124', 'f125', 'f126', 'f127', 'f128', 'f129', 'f130', 'f131', 'f132', 'f133', 'f134', 'f135', 'f136', 'f137', 'f138', 'f139', 'f140', 'f141', 'f142', 'f143', 'f144', 'f145', 'f146', 'f147', 'f148', 'f149', 'f150', 'f151', 'f152', 'f153', 'f154', 'f155', 'f156', 'f157', 'f158', 'f159', 'f160', 'f161', 'f162', 'f163', 'f164', 'f165', 'f166', 'f167', 'f168', 'f169', 'f170', 'f171', 'f172', 'f173', 'f174', 'f175', 'f176', 'f177', 'f178', 'f179', 'f180', 'f181', 'f182', 'f183', 'f184', 'f185', 'f186', 'f187', 'f188', 'f189', 'f190', 'f191', 'f192', 'f193', 'f194', 'f195', 'f196', 'f197', 'f198', 'f199', 'f200', 'f201', 'f202', 'f203', 'f204', 'f205', 'f206', 'f207', 'f208', 'f209', 'f210', 'f211', 'f212', 'f213', 'f214', 'f215', 'f216', 'f217', 'f218', 'f219', 'f220', 'f221', 'f222', 'f223', 'f224', 'f225', 'f226', 'f227', 'f228', 'f229', 'f230', 'f231', 'f232', 'f233', 'f234', 'f235', 'f236', 'f237', 'f238', 'f239', 'f240', 'f241', 'f242', 'f243', 'f244', 'f245', 'f246', 'f247', 'f248', 'f249', 'f250', 'f251', 'f252', 'f253', 'f254', 'f255', 'f256', 'f257', 'f258', 'f259', 'f260', 'f261', 'f262', 'f263', 'f264', 'f265', 'f266', 'f267', 'f268', 'f269', 'f270', 'f271', 'f272', 'f273', 'f274', 'f275', 'f276', 'f277', 'f278', 'f279', 'f280', 'f281', 'f282', 'f283', 'f284', 'f285', 'f286', 'f287', 'f288', 'f289', 'f290', 'f291', 'f292', 'f293', 'f294', 'f295', 'f296', 'f297', 'f298', 'f299', 'f300', 'f301', 'f302', 'f303', 'f304', 'f305', 'f306', 'f307', 'f308', 'f309', 'f310', 'f311', 'f312', 'f313', 'f314', 'f315', 'f316', 'f317', 'f318', 'f319', 'f320', 'f321', 'f322', 'f323', 'f324', 'f325', 'f326', 'f327', 'f328', 'f329', 'f330', 'f331', 'f332', 'f333', 'f334', 'f335', 'f336', 'f337', 'f338', 'f339', 'f340', 'f341', 'f342', 'f343', 'f344', 'f345', 'f346', 'f347', 'f348', 'f349', 'f350', 'f351', 'f352', 'f353', 'f354', 'f355', 'f356', 'f357', 'f358', 'f359', 'f360', 'f361', 'f362', 'f363', 'f364', 'f365', 'f366', 'f367', 'f368', 'f369', 'f370', 'f371', 'f372', 'f373', 'f374', 'f375', 'f376', 'f377', 'f378', 'f379', 'f380', 'f381', 'f382', 'f383', 'f384', 'f385', 'f386', 'f387', 'f388', 'f389', 'f390', 'f391', 'f392', 'f393', 'f394', 'f395', 'f396', 'f397', 'f398', 'f399', 'f400', 'f401', 'f402', 'f403', 'f404', 'f405', 'f406', 'f407', 'f408', 'f409', 'f410', 'f411', 'f412', 'f413', 'f414', 'f415', 'f416', 'f417', 'f418', 'f419', 'f420', 'f421', 'f422', 'f423', 'f424', 'f425', 'f426', 'f427', 'f428', 'f429', 'f430', 'f431', 'f432', 'f433', 'f434', 'f435', 'f436', 'f437', 'f438', 'f439', 'f440', 'f441', 'f442', 'f443', 'f444', 'f445', 'f446', 'f447', 'f448', 'f449', 'f450', 'f451', 'f452', 'f453', 'f454', 'f455', 'f456', 'f457', 'f458', 'f459', 'f460', 'f461', 'f462', 'f463', 'f464', 'f465', 'f466', 'f467', 'f468', 'f469', 'f470', 'f471', 'f472', 'f473', 'f474', 'f475', 'f476', 'f477', 'f478', 'f479', 'f480', 'f481', 'f482', 'f483', 'f484', 'f485', 'f486', 'f487', 'f488', 'f489', 'f490', 'f491', 'f492', 'f493', 'f494', 'f495', 'f496', 'f497', 'f498', 'f499']
training data did not have the following fields: f499

The text was updated successfully, but these errors were encountered:

abhishekkrthakur · 2016-05-31T12:20:46Z

It seems that this works only if the sparse matrics is CSC. It doesn't work for CSR or COO matrices like earlier versions.

sinhrks · 2016-06-18T03:58:25Z

Isn't it a random issue occurs when the right-most columns is all 0 or 1? Maybe the same as #1091 and #1221.

ClimbsRocks · 2016-08-23T19:28:27Z

@sinhrks: For me, that's not "random". I frequently train XGBoost on highly sparse data (and it's awesome! It normally beats all other models, and by a pretty wide margin).

Then, once I've got the trained model running in production, I'll, of course, want to make predictions on a new piece of incoming data. That data, of course, is highly likely to be sparse and not have a value for whatever column happens to be the last column. So XGBoost now frequently breaks for me, and I've found myself switching to other (less accurate) models, simply because they've got better support for sparse data.

bryan-woods · 2016-08-24T17:29:28Z

Does anyone know exactly why this error now arises and how to address it? This is a pain point to me as my existing scripts are failing.

EntilZha · 2016-08-27T02:44:05Z

I am giving xgboost a try as part of an sklearn pipeline and ran into the same issue. Is there a workaround until its fixed?

bryan-woods · 2016-08-27T03:42:11Z

Yes, when you call predict, use the toarray() function of the sparse array. It is terribly inefficient with memory, but is workable with small slices.

Sent from my iPhone

On Aug 26, 2016, at 10:44 PM, Pedro Rodriguez notifications@github.com wrote:

I am giving xgboost a try as part of an sklearn pipeline and ran into the same issue. Is there a workaround until its fixed?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

warpuv · 2016-08-31T19:21:13Z

For some reason the error is not occurring if I save and load trained model:

    bst = xgb.train(param, dtrain, num_round)

    # predict is not working without this code
    bst.save_model(model_file_name)
    bst = xgb.Booster(param)
    bst.load_model(model_file_name)

    preds = bst.predict(dtest)

EntilZha · 2016-08-31T19:23:43Z

@bryan-woods I was able to find a better work around with tocsc. There is probably some performance penalty but not nearly as bad as making it a dense matrix.

Including this in my sklearn pipeline right before xgboost worked

class CSCTransformer(TransformerMixin):
    def transform(self, X, y=None, **fit_params):
        return X.tocsc()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

dmcgarry · 2016-09-01T15:30:10Z

Neither CSC format nor adding non zero entries into the last column fixes the issue in the most recent version of xgboost. Reverting back to version 0.4a30 is the only thing I can get make it work, consider the following tweak (with a reproducible seed) on the original example:

>>> import xgboost as xgb
>>> import numpy as np
>>> from scipy import sparse
>>> 
>>> np.random.seed(10)
>>> X = sparse.rand(100,10).tocsr()
>>> test = sparse.rand(10, 500).tocsr()
>>> y = np.random.randint(2,size=100)
>>> 
>>> clf = xgb.XGBClassifier()
>>> clf.fit(X,y)
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
>>> 
>>> try:
...     pred = clf.predict_proba(test)
...     print "Works when csr with version %s" %xgb.__version__
... except ValueError:
...     "Broken when csr with version %s" %xgb.__version__
... 
'Broken when csr with version 0.6'
>>> try:
...     pred = clf.predict_proba(test.tocsc())
...     print "Works when csc with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when csc with version %s" %xgb.__version__
... 
'Still broken when csc with version 0.6'
>>> try:
...     test[0,(test.shape[1]-1)] = 1.0
...     pred = clf.predict_proba(test)
...     print "Works when adding non-zero entries to last column with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when adding non-zero entries to last column with version %s" %xgb.__version__
... 
/home/david.mcgarry/.conda/envs/ml/lib/python2.7/site-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
'Still broken when adding non-zero entries to last column with version 0.6'

>>> import xgboost as xgb
>>> import numpy as np
>>> from scipy import sparse
>>> 
>>> np.random.seed(10)
>>> X = sparse.rand(100,10).tocsr()
>>> test = sparse.rand(10, 500).tocsr()
>>> y = np.random.randint(2,size=100)
>>> 
>>> clf = xgb.XGBClassifier()
>>> clf.fit(X,y)
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
>>> 
>>> try:
...     pred = clf.predict_proba(test)
...     print "Works when csr with version %s" %xgb.__version__
... except ValueError:
...     "Broken when csr with version %s" %xgb.__version__
... 
Works when csr with version 0.4
>>> try:
...     pred = clf.predict_proba(test.tocsc())
...     print "Works when csc with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when csc with version %s" %xgb.__version__
... 
Works when csc with version 0.4
>>> try:
...     test[0,(test.shape[1]-1)] = 1.0
...     pred = clf.predict_proba(test)
...     print "Works when adding non-zero entries to last column with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when adding non-zero entries to last column with version %s" %xgb.__version__
... 
/Users/david.mcgarry/anaconda/envs/ml/lib/python2.7/site-packages/scipy/sparse/compressed.py:739: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
Works when adding non-zero entries to last column with version 0.4

rth · 2016-09-01T16:35:15Z

Same issue here, something definitely got broken in the last release. Did not have this issue before with the same dataset and processing. I might be wrong, but it looks like currently there is no unit tests with sparse csr arrays in Python using the sklearn API. Would it be possible to add the @dmcgarry example above to tests/python/tests_with_sklearn.py?

bryan-woods · 2016-09-01T20:58:16Z

I tried working around it using .toarray() with CSR sparse arrays, but something is seriously broken. If I load a saved model and try using it to make predictions with .toarray(), I don't get an error message but the results are incorrect. I rolled back to 0.4a30 and it works fine. I haven't had the time to chase down the root cause, but it's not good.

Far0n · 2016-09-02T15:52:59Z

The problem occurs due to DMatrix..num_col() only returning the amount of non-zero columns in a sparse matrix. Hence, if both train & test data have the same amount of non-zero columns, everything works fine.
Otherwise, you end up with different feature names lists, because the validation functions calls:

    @property
    def feature_names(self):
        """Get feature names (column labels).

        Returns
        -------
        feature_names : list or None
        """
        if self._feature_names is None:
            return ['f{0}'.format(i) for i in range(self.num_col())]
        else:
            return self._feature_names

self._feature_names is None for sparse matrices, and because self.num_col() returns only the amount of non-zero cols, the validation fails as soon as the the amount of non-zero columns in the "to-be-predicted" data differs from the amount of non-zero columns in the training data.

Dunno yet, where the best spot is to fix that though.

Far0n · 2016-09-02T16:03:10Z

I'm also afraid, that there is a fundamental problem with sparse matrix handling, because of what @bryan-woods reported: Let's say, we have x zero-columns both in train and in test, but with different indices => There will be no error, because "feature_names(self)" returns the same feature list for both sets, but the predictions mith be wrong, due to not-matching non-zero column indices between train and test.

bryan-woods · 2016-09-16T15:36:12Z

Has anyone worked on this issue? Does anyone at least have a unit test developed that we could use to develop against?

rth · 2016-09-16T16:47:13Z

I have not worked on it but @dmcgarry's example above could be used as a beginning of a unit test, I think,

import xgboost as xgb
import numpy as np
import scipy.sparse


def test_xgbclassifier_sklearn_sparse():
    np.random.seed(10)
    X = scipy.sparse.rand(100,10).tocsr()
    test = scipy.sparse.rand(10, 500).tocsr()
    y = np.random.randint(2,size=100)

    clf = xgb.XGBClassifier()
    clf.fit(X,y)
    pred = clf.predict_proba(test)

bryan-woods · 2016-09-16T20:00:21Z

I created a couple new sparse array tests in my fork of the repo. For those who are interested:
https://github.com/bryan-woods/xgboost/blob/sparse_test/tests/python/test_scipy_sparse.py

To run the tests from the root directory of the checkout:
python -m nose tests/python/test_scipy_sparse.py

You'll notice that both tests fail. This at least will provide a test to develop against.

vallettea · 2016-09-17T12:59:10Z

I'm also experiencing this issue, but I not able to figure out what's the best way to fix until it is finally solved in the lib.

tqchen · 2016-09-17T16:40:27Z

move to #1583

bihujrj · 2016-11-08T08:08:12Z

you could add a feature to your feature list with a max feature index, such as maxid:0

nazirmubbashir · 2016-11-23T09:11:51Z

passing a dataframe solved the issue for me

dfernandez22 · 2016-11-24T00:29:23Z

how can I revert to version 0.4?

dmcgarry · 2016-11-24T02:01:39Z

pip install --upgrade xgboost==0.4a30

ad-owens · 2016-11-27T14:24:16Z

All types of sparse matrices didn't work for me (I'm working with tf-idf data). I had to revert to previous version. Thanks for the tip!

l3link · 2016-12-08T08:53:08Z

Huh,

In [64]: xgboost.__version__ Out[64]: '0.6'

Signature: matrix._init_from_csr(csr) Source: def _init_from_csr(self, csr): """ Initialize data from a CSR matrix. """ if len(csr.indices) != len(csr.data): raise ValueError('length mismatch: {} vs {}'.format(len(csr.indices), len(csr.data))) self.handle = ctypes.c_void_p() _check_call(_LIB.XGDMatrixCreateFromCSR(c_array(ctypes.c_ulong, csr.indptr), c_array(ctypes.c_uint, csr.indices), c_array(ctypes.c_float, csr.data), len(csr.indptr), len(csr.data), ctypes.byref(self.handle))) File: ~/anaconda/lib/python2.7/site-packages/xgboost/core.py Type: instancemethod

Seems bizarre that my .6 version has XGDMatrixCreateFromCSR instead of XGDMatrixCreateFromCSREx instructions, which don't take in the shape.
Is it possible the osx distribution is different?

dmlc/xgboost#1238 (comment)

ghost · 2016-12-12T11:06:53Z

I'm also afraid, that there is a fundamental problem with sparse matrix handling, because of what @bryan-woods reported: Let's say, we have x zero-columns both in train and in test, but with different indices => There will be no error, because "feature_names(self)" returns the same feature list for both sets, but the predictions mith be wrong, due to not-matching non-zero column indices between train and test.

Can someone please answer this question? I reverted back to 0.4 version and now it seems to work but I'm afraid if it's working properly because I'm still using really sparse matrices.

khotilov · 2017-01-02T07:23:28Z

@l3link nothing bizarre about it: version numbers (or pypi packages) are sometimes not updated for long time. E.g., the https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/VERSION file as of today was last changed on Jul 29th, and the last pypi package https://pypi.python.org/pypi/xgboost/ is dated Aug 9th. While the fix was submitted on Sep 23rd #1606. Please check out the latest code from github.

oleksandrasaskia · 2017-01-28T15:46:21Z

I had this problem when I used pandas DataFrame (non-sparse representation).
I converted it to numpy ndarray via df.as_matrix() , and I got rid of the error.

pnandhini11 · 2017-03-23T06:05:36Z

I too got rid of this error after converting dataframe to array.

fx86 · 2017-06-25T07:56:21Z

Reordering the columns in the test-set in the same order as the train set fixed this for me.
I used Pandas dataframes. Without this, using .as_matrix() was throwing the same issue.

I did:

test = test[train.columns]

nguyentp · 2017-10-20T06:19:12Z

I tried @warpuv solution and it worked. My data is large, I cannot load them to memory to reorder columns.

bdod6 · 2017-11-02T01:22:12Z

Converting train/test csr matrices to csc worked for me

Xtrain = scipy.sparse.csc_matrix(Xtrain)

mrgloom · 2017-12-15T10:56:20Z

Converting to csc_matrix works, tested on 0.6a2:

	X_train = scipy.sparse.csc_matrix(X_train)
	X_test = scipy.sparse.csc_matrix(X_test)
	
	xgb_train = xgb.DMatrix(X_train, label=y_train)
	xgb_test = xgb.DMatrix(X_test, label=y_test)

type(X_train) <class 'scipy.sparse.csr.csr_matrix'>
type(X_test) <class 'scipy.sparse.csr.csr_matrix'>
type(X_train) <class 'scipy.sparse.csc.csc_matrix'>
type(X_test) <class 'scipy.sparse.csc.csc_matrix'>
type(xgb_train) <class 'xgboost.core.DMatrix'>
type(xgb_test) <class 'xgboost.core.DMatrix'>

My original sparse matrix is output of sklearn tf-idf vectorizer were in csr_matrix format.

pallavbakshi · 2018-01-21T12:08:21Z

Is there any fix yet?

ewellinger · 2018-01-24T20:59:59Z

Just built the latest version (0.7.post3) in python3 and I can confirm that this issue still exists. After adapting @dmcgarry example above I am still seeing issues with both csr_matrix and csc_matrix.

import xgboost as xgb
import numpy as np
from scipy import sparse

np.random.seed(10)

X_csr = sparse.rand(100, 10).tocsr()
test_csr = sparse.rand(10, 500).tocsr()

X_csc = sparse.rand(100, 10).tocsc()
test_csc = sparse.rand(10, 500).tocsc()

y = np.random.randint(2, size=100)

clf_csr = xgb.XGBClassifier()
clf_csr.fit(X_csr, y)

clf_csc = xgb.XGBClassifier()
clf_csc.fit(X_csc, y)

# Try with csr
try:
    pred = clf_csr.predict_proba(test_csr)
    print("Works when csr with version %s" %xgb.__version__)
except ValueError:
    print("Broken when csr with version %s" %xgb.__version__)

try:
    test_csr[0,(test_csr.shape[1]-1)] = 1.0
    pred = clf_csr.predict_proba(test_csr)
    print("Works when adding non-zero entries to last column with version %s" %xgb.__version__)
except:
    print("Still broken when adding non-zero entries to last column with version %s" %xgb.__version__)

# Try with csc
try:
    pred = clf_csc.predict_proba(test_csc)
    print("Works when csc with version %s" %xgb.__version__)
except ValueError:
    print("Broken when csc with version %s" %xgb.__version__)

try:
    test_csc[0,(test_csc.shape[1]-1)] = 1.0
    pred = clf_csc.predict_proba(test_csc)
    print("Works when adding non-zero entries to last column with version %s" %xgb.__version__)
except:
    print("Still broken when adding non-zero entries to last column with version %s" %xgb.__version__)

The above code resulted in the following output:

Broken when csr with version 0.7
Still broken when adding non-zero entries to last column with version 0.7
Broken when csc with version 0.7
Still broken when adding non-zero entries to last column with version 0.7

hhristov94 · 2018-01-27T21:16:50Z

pls help

ewellinger · 2018-02-28T22:29:24Z

Why has this issue been closed?

CathyQian · 2018-03-01T02:06:02Z

I came across this problem twice recently. For one case, I simply change the input dataframe into array and it works. For the second one, I have to realign the column names of the test dataframe using test_df = test_df[train_df.columns]. In both cases, the train_df and test_df have exactly the same column names.

ewellinger · 2018-03-01T16:27:24Z

I guess I don't understand your comment @CathyQian, are those train_df/test_df sparse? Also, which version of xgboost were you running when you ran into these issues?

khotilov · 2018-03-02T22:15:01Z

@CathyQian xgboost relies on the order of columns, and that is not related to this issue.

@ewellinger WRT your example: a model trained on data with 10 features should not accept data with 500 features for prediction, hence the error is thrown. Also, creating DMatrices from all of your matrices and inspecting their num_col and num_row produces expected results.

The current state of "sparsity issues" is:

DMatrix creation from CSR and its use in a model should work correctly. The issue was closed since that was the subject of this issue.
DMatrix creation from CSC produces an object with correct dimensions, but it might give incorrect results during training or prediction when last rows are fully sparse Handling of NAs/missings - sparse and dense matrices giving different results #2630. I didn't yet have time to properly fix that part.
A parameter to specify a predefined number of columns when loading libsvm data into DMatrix was not implemented yet. Volunteers to contribute are welcome.

rainness · 2018-03-20T07:05:45Z

@warpuv it works for me, thanks a lot.

ag95v2 · 2018-06-01T02:06:27Z

Had the same error, with dense matrices. ( xgboost v.0.6 from the latest anaconda.)
Error occured when I ran multiple regressions on different feature subsets of the training sample.
Creating new model instance each time before fitting a next regression fixed the problem.

JulianKlug · 2018-10-04T16:35:33Z

A parameter to specify a predefined number of columns when loading libsvm data into DMatrix was not implemented yet. Volunteers to contribute are welcome.

As of 0.8, this still doesn't exist right?

hcho3 · 2018-10-04T16:56:13Z

DMatrix creation from CSC produces an object with correct dimensions, but it might give incorrect results during training or prediction when last rows are fully sparse #2630. I didn't yet have time to properly fix that part.

@khotilov #3553 fixed this issue.

A parameter to specify a predefined number of columns when loading libsvm data into DMatrix was not implemented yet. Volunteers to contribute are welcome.

@MonsieurWave For this feature, a small pull request to dmlc-core should do the trick. Let me look at it.

JulianKlug · 2018-10-04T16:59:45Z

@hcho3 Thanks a lot.

For now, I circumvent this issue by having the first line of my libsvm not so sparse, ie.: saving even the columns with value 0.

inexxt mentioned this issue Aug 5, 2016

feature_names mismatch on sparse matrices #1441

Closed

phunterlau mentioned this issue Aug 11, 2016

pip version 0.6 does not support sparse feature vectors #1456

Closed

ClimbsRocks mentioned this issue Aug 21, 2016

make taking the natural log of y values grid searchable. ClimbsRocks/auto_ml#42

Closed

bryan-woods mentioned this issue Sep 16, 2016

Sparse matrices w/ sklearn api #843

Closed

tqchen closed this as completed Sep 17, 2016

tqchen mentioned this issue Sep 22, 2016

[PYTHON] Make Feature Name Validation Optional #1605

Closed

rth mentioned this issue Nov 4, 2016

Status of cross-platform support FreeDiscovery/FreeDiscovery#10

Closed

lopuhin added a commit to TeamHG-Memex/hh-page-classifier that referenced this issue Dec 8, 2016

Work around xgboost issue with sparce matrices

d20b39f

dmlc/xgboost#1238 (comment)

lopuhin added a commit to TeamHG-Memex/hh-page-classifier that referenced this issue Dec 9, 2016

Work around xgboost issue with sparce matrices

bf839df

dmlc/xgboost#1238 (comment)

calz1 mentioned this issue Mar 29, 2017

xgboost in advanced_requirements.txt unreliable ClimbsRocks/auto_ml#187

Closed

justinrgarrard mentioned this issue Sep 29, 2017

feature_names mismatch when using xgboost + sklearn (XGBClassifier) + eli5(explain_prediction) #2334

Closed

ChristianSch mentioned this issue Dec 11, 2017

Problem transformation does not work with XGBoost scikit-multilearn/scikit-multilearn#88

Closed

mrgloom mentioned this issue Dec 18, 2017

The small bug with multiclass xgboost #1208

Closed

ianozsvald mentioned this issue Mar 13, 2018

Conflict with PermutationImportance, DataFrame and XGBoost (with workaround) TeamHG-Memex/eli5#256

Closed

liuliu629 mentioned this issue Jul 18, 2018

Crazy print ( 'f2295087', 'f2295088', 'f2295089', 'f2295090', 'f2295091', 'f2295092', 'f2295093', 'f2295094', 'f2295095', 'f2295096', 'f2295097', 'f2295098', 'f2295099', 'f2295100'.....)during training #3487

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 2, 2019

feature_names mismatch while using sparse matrices in Python #1238

feature_names mismatch while using sparse matrices in Python #1238

Comments

abhishekkrthakur commented May 31, 2016 • edited

code

Full traceback:

abhishekkrthakur commented May 31, 2016

sinhrks commented Jun 18, 2016

ClimbsRocks commented Aug 23, 2016

bryan-woods commented Aug 24, 2016

EntilZha commented Aug 27, 2016

bryan-woods commented Aug 27, 2016

warpuv commented Aug 31, 2016

EntilZha commented Aug 31, 2016 • edited

dmcgarry commented Sep 1, 2016 • edited

rth commented Sep 1, 2016

bryan-woods commented Sep 1, 2016

Far0n commented Sep 2, 2016

Far0n commented Sep 2, 2016 • edited

bryan-woods commented Sep 16, 2016

rth commented Sep 16, 2016

bryan-woods commented Sep 16, 2016

vallettea commented Sep 17, 2016

tqchen commented Sep 17, 2016

bihujrj commented Nov 8, 2016

nazirmubbashir commented Nov 23, 2016

dfernandez22 commented Nov 24, 2016

dmcgarry commented Nov 24, 2016

ad-owens commented Nov 27, 2016

l3link commented Dec 8, 2016 • edited

ghost commented Dec 12, 2016

khotilov commented Jan 2, 2017

oleksandrasaskia commented Jan 28, 2017 • edited

pnandhini11 commented Mar 23, 2017

fx86 commented Jun 25, 2017

nguyentp commented Oct 20, 2017

bdod6 commented Nov 2, 2017

mrgloom commented Dec 15, 2017 • edited

pallavbakshi commented Jan 21, 2018

ewellinger commented Jan 24, 2018

hhristov94 commented Jan 27, 2018

ewellinger commented Feb 28, 2018

CathyQian commented Mar 1, 2018

ewellinger commented Mar 1, 2018

khotilov commented Mar 2, 2018

rainness commented Mar 20, 2018

ag95v2 commented Jun 1, 2018

JulianKlug commented Oct 4, 2018

hcho3 commented Oct 4, 2018

JulianKlug commented Oct 4, 2018 • edited

abhishekkrthakur commented May 31, 2016 •

edited

EntilZha commented Aug 31, 2016 •

edited

dmcgarry commented Sep 1, 2016 •

edited

Far0n commented Sep 2, 2016 •

edited

l3link commented Dec 8, 2016 •

edited

oleksandrasaskia commented Jan 28, 2017 •

edited

mrgloom commented Dec 15, 2017 •

edited

JulianKlug commented Oct 4, 2018 •

edited