Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature_names mismatch while using sparse matrices in Python #1238

Closed
abhishekkrthakur opened this issue May 31, 2016 · 51 comments
Closed

feature_names mismatch while using sparse matrices in Python #1238

abhishekkrthakur opened this issue May 31, 2016 · 51 comments

Comments

@abhishekkrthakur
Copy link

abhishekkrthakur commented May 31, 2016

I'm getting ValueError: feature_names mismatch while training xgboost with sparse matrices in python.
The xgboost version is latest from git. Older versions don't give this error. Error is returned during prediction time.

code

from scipy import sparse
import xgboost as xgb
from random import *
randBinList = lambda n: [randint(0,1) for b in range(1,n+1)]

train = sparse.rand(100,500)
test = sparse.rand(10, 500)
y = randBinList(100)
clf = xgb.XGBClassifier()
clf.fit(train,y)
preds = clf.predict_proba(test)

Full traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-e03f10289bf1> in <module>()
----> 1 preds = clf.predict_proba(test)

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/sklearn.pyc in predict_proba(self, data, output_margin, ntree_limit)
    471         class_probs = self.booster().predict(test_dmatrix,
    472                                              output_margin=output_margin,
--> 473                                              ntree_limit=ntree_limit)
    474         if self.objective == "multi:softprob":
    475             return class_probs

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', 'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'f100', 'f101', 'f102', 'f103', 'f104', 'f105', 'f106', 'f107', 'f108', 'f109', 'f110', 'f111', 'f112', 'f113', 'f114', 'f115', 'f116', 'f117', 'f118', 'f119', 'f120', 'f121', 'f122', 'f123', 'f124', 'f125', 'f126', 'f127', 'f128', 'f129', 'f130', 'f131', 'f132', 'f133', 'f134', 'f135', 'f136', 'f137', 'f138', 'f139', 'f140', 'f141', 'f142', 'f143', 'f144', 'f145', 'f146', 'f147', 'f148', 'f149', 'f150', 'f151', 'f152', 'f153', 'f154', 'f155', 'f156', 'f157', 'f158', 'f159', 'f160', 'f161', 'f162', 'f163', 'f164', 'f165', 'f166', 'f167', 'f168', 'f169', 'f170', 'f171', 'f172', 'f173', 'f174', 'f175', 'f176', 'f177', 'f178', 'f179', 'f180', 'f181', 'f182', 'f183', 'f184', 'f185', 'f186', 'f187', 'f188', 'f189', 'f190', 'f191', 'f192', 'f193', 'f194', 'f195', 'f196', 'f197', 'f198', 'f199', 'f200', 'f201', 'f202', 'f203', 'f204', 'f205', 'f206', 'f207', 'f208', 'f209', 'f210', 'f211', 'f212', 'f213', 'f214', 'f215', 'f216', 'f217', 'f218', 'f219', 'f220', 'f221', 'f222', 'f223', 'f224', 'f225', 'f226', 'f227', 'f228', 'f229', 'f230', 'f231', 'f232', 'f233', 'f234', 'f235', 'f236', 'f237', 'f238', 'f239', 'f240', 'f241', 'f242', 'f243', 'f244', 'f245', 'f246', 'f247', 'f248', 'f249', 'f250', 'f251', 'f252', 'f253', 'f254', 'f255', 'f256', 'f257', 'f258', 'f259', 'f260', 'f261', 'f262', 'f263', 'f264', 'f265', 'f266', 'f267', 'f268', 'f269', 'f270', 'f271', 'f272', 'f273', 'f274', 'f275', 'f276', 'f277', 'f278', 'f279', 'f280', 'f281', 'f282', 'f283', 'f284', 'f285', 'f286', 'f287', 'f288', 'f289', 'f290', 'f291', 'f292', 'f293', 'f294', 'f295', 'f296', 'f297', 'f298', 'f299', 'f300', 'f301', 'f302', 'f303', 'f304', 'f305', 'f306', 'f307', 'f308', 'f309', 'f310', 'f311', 'f312', 'f313', 'f314', 'f315', 'f316', 'f317', 'f318', 'f319', 'f320', 'f321', 'f322', 'f323', 'f324', 'f325', 'f326', 'f327', 'f328', 'f329', 'f330', 'f331', 'f332', 'f333', 'f334', 'f335', 'f336', 'f337', 'f338', 'f339', 'f340', 'f341', 'f342', 'f343', 'f344', 'f345', 'f346', 'f347', 'f348', 'f349', 'f350', 'f351', 'f352', 'f353', 'f354', 'f355', 'f356', 'f357', 'f358', 'f359', 'f360', 'f361', 'f362', 'f363', 'f364', 'f365', 'f366', 'f367', 'f368', 'f369', 'f370', 'f371', 'f372', 'f373', 'f374', 'f375', 'f376', 'f377', 'f378', 'f379', 'f380', 'f381', 'f382', 'f383', 'f384', 'f385', 'f386', 'f387', 'f388', 'f389', 'f390', 'f391', 'f392', 'f393', 'f394', 'f395', 'f396', 'f397', 'f398', 'f399', 'f400', 'f401', 'f402', 'f403', 'f404', 'f405', 'f406', 'f407', 'f408', 'f409', 'f410', 'f411', 'f412', 'f413', 'f414', 'f415', 'f416', 'f417', 'f418', 'f419', 'f420', 'f421', 'f422', 'f423', 'f424', 'f425', 'f426', 'f427', 'f428', 'f429', 'f430', 'f431', 'f432', 'f433', 'f434', 'f435', 'f436', 'f437', 'f438', 'f439', 'f440', 'f441', 'f442', 'f443', 'f444', 'f445', 'f446', 'f447', 'f448', 'f449', 'f450', 'f451', 'f452', 'f453', 'f454', 'f455', 'f456', 'f457', 'f458', 'f459', 'f460', 'f461', 'f462', 'f463', 'f464', 'f465', 'f466', 'f467', 'f468', 'f469', 'f470', 'f471', 'f472', 'f473', 'f474', 'f475', 'f476', 'f477', 'f478', 'f479', 'f480', 'f481', 'f482', 'f483', 'f484', 'f485', 'f486', 'f487', 'f488', 'f489', 'f490', 'f491', 'f492', 'f493', 'f494', 'f495', 'f496', 'f497', 'f498'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', 'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'f100', 'f101', 'f102', 'f103', 'f104', 'f105', 'f106', 'f107', 'f108', 'f109', 'f110', 'f111', 'f112', 'f113', 'f114', 'f115', 'f116', 'f117', 'f118', 'f119', 'f120', 'f121', 'f122', 'f123', 'f124', 'f125', 'f126', 'f127', 'f128', 'f129', 'f130', 'f131', 'f132', 'f133', 'f134', 'f135', 'f136', 'f137', 'f138', 'f139', 'f140', 'f141', 'f142', 'f143', 'f144', 'f145', 'f146', 'f147', 'f148', 'f149', 'f150', 'f151', 'f152', 'f153', 'f154', 'f155', 'f156', 'f157', 'f158', 'f159', 'f160', 'f161', 'f162', 'f163', 'f164', 'f165', 'f166', 'f167', 'f168', 'f169', 'f170', 'f171', 'f172', 'f173', 'f174', 'f175', 'f176', 'f177', 'f178', 'f179', 'f180', 'f181', 'f182', 'f183', 'f184', 'f185', 'f186', 'f187', 'f188', 'f189', 'f190', 'f191', 'f192', 'f193', 'f194', 'f195', 'f196', 'f197', 'f198', 'f199', 'f200', 'f201', 'f202', 'f203', 'f204', 'f205', 'f206', 'f207', 'f208', 'f209', 'f210', 'f211', 'f212', 'f213', 'f214', 'f215', 'f216', 'f217', 'f218', 'f219', 'f220', 'f221', 'f222', 'f223', 'f224', 'f225', 'f226', 'f227', 'f228', 'f229', 'f230', 'f231', 'f232', 'f233', 'f234', 'f235', 'f236', 'f237', 'f238', 'f239', 'f240', 'f241', 'f242', 'f243', 'f244', 'f245', 'f246', 'f247', 'f248', 'f249', 'f250', 'f251', 'f252', 'f253', 'f254', 'f255', 'f256', 'f257', 'f258', 'f259', 'f260', 'f261', 'f262', 'f263', 'f264', 'f265', 'f266', 'f267', 'f268', 'f269', 'f270', 'f271', 'f272', 'f273', 'f274', 'f275', 'f276', 'f277', 'f278', 'f279', 'f280', 'f281', 'f282', 'f283', 'f284', 'f285', 'f286', 'f287', 'f288', 'f289', 'f290', 'f291', 'f292', 'f293', 'f294', 'f295', 'f296', 'f297', 'f298', 'f299', 'f300', 'f301', 'f302', 'f303', 'f304', 'f305', 'f306', 'f307', 'f308', 'f309', 'f310', 'f311', 'f312', 'f313', 'f314', 'f315', 'f316', 'f317', 'f318', 'f319', 'f320', 'f321', 'f322', 'f323', 'f324', 'f325', 'f326', 'f327', 'f328', 'f329', 'f330', 'f331', 'f332', 'f333', 'f334', 'f335', 'f336', 'f337', 'f338', 'f339', 'f340', 'f341', 'f342', 'f343', 'f344', 'f345', 'f346', 'f347', 'f348', 'f349', 'f350', 'f351', 'f352', 'f353', 'f354', 'f355', 'f356', 'f357', 'f358', 'f359', 'f360', 'f361', 'f362', 'f363', 'f364', 'f365', 'f366', 'f367', 'f368', 'f369', 'f370', 'f371', 'f372', 'f373', 'f374', 'f375', 'f376', 'f377', 'f378', 'f379', 'f380', 'f381', 'f382', 'f383', 'f384', 'f385', 'f386', 'f387', 'f388', 'f389', 'f390', 'f391', 'f392', 'f393', 'f394', 'f395', 'f396', 'f397', 'f398', 'f399', 'f400', 'f401', 'f402', 'f403', 'f404', 'f405', 'f406', 'f407', 'f408', 'f409', 'f410', 'f411', 'f412', 'f413', 'f414', 'f415', 'f416', 'f417', 'f418', 'f419', 'f420', 'f421', 'f422', 'f423', 'f424', 'f425', 'f426', 'f427', 'f428', 'f429', 'f430', 'f431', 'f432', 'f433', 'f434', 'f435', 'f436', 'f437', 'f438', 'f439', 'f440', 'f441', 'f442', 'f443', 'f444', 'f445', 'f446', 'f447', 'f448', 'f449', 'f450', 'f451', 'f452', 'f453', 'f454', 'f455', 'f456', 'f457', 'f458', 'f459', 'f460', 'f461', 'f462', 'f463', 'f464', 'f465', 'f466', 'f467', 'f468', 'f469', 'f470', 'f471', 'f472', 'f473', 'f474', 'f475', 'f476', 'f477', 'f478', 'f479', 'f480', 'f481', 'f482', 'f483', 'f484', 'f485', 'f486', 'f487', 'f488', 'f489', 'f490', 'f491', 'f492', 'f493', 'f494', 'f495', 'f496', 'f497', 'f498', 'f499']
training data did not have the following fields: f499

@abhishekkrthakur
Copy link
Author

It seems that this works only if the sparse matrics is CSC. It doesn't work for CSR or COO matrices like earlier versions.

@sinhrks
Copy link
Contributor

sinhrks commented Jun 18, 2016

Isn't it a random issue occurs when the right-most columns is all 0 or 1? Maybe the same as #1091 and #1221.

@ClimbsRocks
Copy link
Contributor

@sinhrks: For me, that's not "random". I frequently train XGBoost on highly sparse data (and it's awesome! It normally beats all other models, and by a pretty wide margin).

Then, once I've got the trained model running in production, I'll, of course, want to make predictions on a new piece of incoming data. That data, of course, is highly likely to be sparse and not have a value for whatever column happens to be the last column. So XGBoost now frequently breaks for me, and I've found myself switching to other (less accurate) models, simply because they've got better support for sparse data.

@bryan-woods
Copy link
Contributor

Does anyone know exactly why this error now arises and how to address it? This is a pain point to me as my existing scripts are failing.

@EntilZha
Copy link

I am giving xgboost a try as part of an sklearn pipeline and ran into the same issue. Is there a workaround until its fixed?

@bryan-woods
Copy link
Contributor

Yes, when you call predict, use the toarray() function of the sparse array. It is terribly inefficient with memory, but is workable with small slices.

Sent from my iPhone

On Aug 26, 2016, at 10:44 PM, Pedro Rodriguez notifications@github.com wrote:

I am giving xgboost a try as part of an sklearn pipeline and ran into the same issue. Is there a workaround until its fixed?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@warpuv
Copy link

warpuv commented Aug 31, 2016

For some reason the error is not occurring if I save and load trained model:

    bst = xgb.train(param, dtrain, num_round)

    # predict is not working without this code
    bst.save_model(model_file_name)
    bst = xgb.Booster(param)
    bst.load_model(model_file_name)

    preds = bst.predict(dtest)

@EntilZha
Copy link

EntilZha commented Aug 31, 2016

@bryan-woods I was able to find a better work around with tocsc. There is probably some performance penalty but not nearly as bad as making it a dense matrix.

Including this in my sklearn pipeline right before xgboost worked

class CSCTransformer(TransformerMixin):
    def transform(self, X, y=None, **fit_params):
        return X.tocsc()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

@dmcgarry
Copy link

dmcgarry commented Sep 1, 2016

Neither CSC format nor adding non zero entries into the last column fixes the issue in the most recent version of xgboost. Reverting back to version 0.4a30 is the only thing I can get make it work, consider the following tweak (with a reproducible seed) on the original example:

>>> import xgboost as xgb
>>> import numpy as np
>>> from scipy import sparse
>>> 
>>> np.random.seed(10)
>>> X = sparse.rand(100,10).tocsr()
>>> test = sparse.rand(10, 500).tocsr()
>>> y = np.random.randint(2,size=100)
>>> 
>>> clf = xgb.XGBClassifier()
>>> clf.fit(X,y)
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
>>> 
>>> try:
...     pred = clf.predict_proba(test)
...     print "Works when csr with version %s" %xgb.__version__
... except ValueError:
...     "Broken when csr with version %s" %xgb.__version__
... 
'Broken when csr with version 0.6'
>>> try:
...     pred = clf.predict_proba(test.tocsc())
...     print "Works when csc with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when csc with version %s" %xgb.__version__
... 
'Still broken when csc with version 0.6'
>>> try:
...     test[0,(test.shape[1]-1)] = 1.0
...     pred = clf.predict_proba(test)
...     print "Works when adding non-zero entries to last column with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when adding non-zero entries to last column with version %s" %xgb.__version__
... 
/home/david.mcgarry/.conda/envs/ml/lib/python2.7/site-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
'Still broken when adding non-zero entries to last column with version 0.6'
>>> import xgboost as xgb
>>> import numpy as np
>>> from scipy import sparse
>>> 
>>> np.random.seed(10)
>>> X = sparse.rand(100,10).tocsr()
>>> test = sparse.rand(10, 500).tocsr()
>>> y = np.random.randint(2,size=100)
>>> 
>>> clf = xgb.XGBClassifier()
>>> clf.fit(X,y)
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
>>> 
>>> try:
...     pred = clf.predict_proba(test)
...     print "Works when csr with version %s" %xgb.__version__
... except ValueError:
...     "Broken when csr with version %s" %xgb.__version__
... 
Works when csr with version 0.4
>>> try:
...     pred = clf.predict_proba(test.tocsc())
...     print "Works when csc with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when csc with version %s" %xgb.__version__
... 
Works when csc with version 0.4
>>> try:
...     test[0,(test.shape[1]-1)] = 1.0
...     pred = clf.predict_proba(test)
...     print "Works when adding non-zero entries to last column with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when adding non-zero entries to last column with version %s" %xgb.__version__
... 
/Users/david.mcgarry/anaconda/envs/ml/lib/python2.7/site-packages/scipy/sparse/compressed.py:739: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
Works when adding non-zero entries to last column with version 0.4

@rth
Copy link

rth commented Sep 1, 2016

Same issue here, something definitely got broken in the last release. Did not have this issue before with the same dataset and processing. I might be wrong, but it looks like currently there is no unit tests with sparse csr arrays in Python using the sklearn API. Would it be possible to add the @dmcgarry example above to tests/python/tests_with_sklearn.py?

@bryan-woods
Copy link
Contributor

I tried working around it using .toarray() with CSR sparse arrays, but something is seriously broken. If I load a saved model and try using it to make predictions with .toarray(), I don't get an error message but the results are incorrect. I rolled back to 0.4a30 and it works fine. I haven't had the time to chase down the root cause, but it's not good.

@Far0n
Copy link
Contributor

Far0n commented Sep 2, 2016

The problem occurs due to DMatrix..num_col() only returning the amount of non-zero columns in a sparse matrix. Hence, if both train & test data have the same amount of non-zero columns, everything works fine.
Otherwise, you end up with different feature names lists, because the validation functions calls:

    @property
    def feature_names(self):
        """Get feature names (column labels).

        Returns
        -------
        feature_names : list or None
        """
        if self._feature_names is None:
            return ['f{0}'.format(i) for i in range(self.num_col())]
        else:
            return self._feature_names

self._feature_names is None for sparse matrices, and because self.num_col() returns only the amount of non-zero cols, the validation fails as soon as the the amount of non-zero columns in the "to-be-predicted" data differs from the amount of non-zero columns in the training data.

Dunno yet, where the best spot is to fix that though.

@Far0n
Copy link
Contributor

Far0n commented Sep 2, 2016

I'm also afraid, that there is a fundamental problem with sparse matrix handling, because of what @bryan-woods reported: Let's say, we have x zero-columns both in train and in test, but with different indices => There will be no error, because "feature_names(self)" returns the same feature list for both sets, but the predictions mith be wrong, due to not-matching non-zero column indices between train and test.

@bryan-woods
Copy link
Contributor

Has anyone worked on this issue? Does anyone at least have a unit test developed that we could use to develop against?

@rth
Copy link

rth commented Sep 16, 2016

I have not worked on it but @dmcgarry's example above could be used as a beginning of a unit test, I think,

import xgboost as xgb
import numpy as np
import scipy.sparse


def test_xgbclassifier_sklearn_sparse():
    np.random.seed(10)
    X = scipy.sparse.rand(100,10).tocsr()
    test = scipy.sparse.rand(10, 500).tocsr()
    y = np.random.randint(2,size=100)

    clf = xgb.XGBClassifier()
    clf.fit(X,y)
    pred = clf.predict_proba(test)

@bryan-woods
Copy link
Contributor

I created a couple new sparse array tests in my fork of the repo. For those who are interested:
https://github.com/bryan-woods/xgboost/blob/sparse_test/tests/python/test_scipy_sparse.py

To run the tests from the root directory of the checkout:
python -m nose tests/python/test_scipy_sparse.py

You'll notice that both tests fail. This at least will provide a test to develop against.

@vallettea
Copy link

I'm also experiencing this issue, but I not able to figure out what's the best way to fix until it is finally solved in the lib.

@tqchen
Copy link
Member

tqchen commented Sep 17, 2016

move to #1583

@bihujrj
Copy link

bihujrj commented Nov 8, 2016

you could add a feature to your feature list with a max feature index, such as maxid:0

@nazirmubbashir
Copy link

passing a dataframe solved the issue for me

@dfernandez22
Copy link

how can I revert to version 0.4?

@dmcgarry
Copy link

pip install --upgrade xgboost==0.4a30

@ad-owens
Copy link

All types of sparse matrices didn't work for me (I'm working with tf-idf data). I had to revert to previous version. Thanks for the tip!

@l3link
Copy link

l3link commented Dec 8, 2016

Huh,

In [64]: xgboost.__version__ Out[64]: '0.6'

Signature: matrix._init_from_csr(csr) Source: def _init_from_csr(self, csr): """ Initialize data from a CSR matrix. """ if len(csr.indices) != len(csr.data): raise ValueError('length mismatch: {} vs {}'.format(len(csr.indices), len(csr.data))) self.handle = ctypes.c_void_p() _check_call(_LIB.XGDMatrixCreateFromCSR(c_array(ctypes.c_ulong, csr.indptr), c_array(ctypes.c_uint, csr.indices), c_array(ctypes.c_float, csr.data), len(csr.indptr), len(csr.data), ctypes.byref(self.handle))) File: ~/anaconda/lib/python2.7/site-packages/xgboost/core.py Type: instancemethod

Seems bizarre that my .6 version has XGDMatrixCreateFromCSR instead of XGDMatrixCreateFromCSREx instructions, which don't take in the shape.
Is it possible the osx distribution is different?

lopuhin added a commit to TeamHG-Memex/hh-page-classifier that referenced this issue Dec 8, 2016
lopuhin added a commit to TeamHG-Memex/hh-page-classifier that referenced this issue Dec 9, 2016
@ghost
Copy link

ghost commented Dec 12, 2016

I'm also afraid, that there is a fundamental problem with sparse matrix handling, because of what @bryan-woods reported: Let's say, we have x zero-columns both in train and in test, but with different indices => There will be no error, because "feature_names(self)" returns the same feature list for both sets, but the predictions mith be wrong, due to not-matching non-zero column indices between train and test.

Can someone please answer this question? I reverted back to 0.4 version and now it seems to work but I'm afraid if it's working properly because I'm still using really sparse matrices.

@khotilov
Copy link
Member

khotilov commented Jan 2, 2017

@l3link nothing bizarre about it: version numbers (or pypi packages) are sometimes not updated for long time. E.g., the https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/VERSION file as of today was last changed on Jul 29th, and the last pypi package https://pypi.python.org/pypi/xgboost/ is dated Aug 9th. While the fix was submitted on Sep 23rd #1606. Please check out the latest code from github.

@oleksandrasaskia
Copy link

oleksandrasaskia commented Jan 28, 2017

I had this problem when I used pandas DataFrame (non-sparse representation).
I converted it to numpy ndarray via df.as_matrix() , and I got rid of the error.

@pnandhini11
Copy link

I too got rid of this error after converting dataframe to array.

@fx86
Copy link

fx86 commented Jun 25, 2017

Reordering the columns in the test-set in the same order as the train set fixed this for me.
I used Pandas dataframes. Without this, using .as_matrix() was throwing the same issue.

I did:

test = test[train.columns]

@nguyentp
Copy link

I tried @warpuv solution and it worked. My data is large, I cannot load them to memory to reorder columns.

@bdod6
Copy link

bdod6 commented Nov 2, 2017

Converting train/test csr matrices to csc worked for me

Xtrain = scipy.sparse.csc_matrix(Xtrain)

@mrgloom
Copy link

mrgloom commented Dec 15, 2017

Converting to csc_matrix works, tested on 0.6a2:

	X_train = scipy.sparse.csc_matrix(X_train)
	X_test = scipy.sparse.csc_matrix(X_test)
	
	xgb_train = xgb.DMatrix(X_train, label=y_train)
	xgb_test = xgb.DMatrix(X_test, label=y_test)
type(X_train) <class 'scipy.sparse.csr.csr_matrix'>
type(X_test) <class 'scipy.sparse.csr.csr_matrix'>
type(X_train) <class 'scipy.sparse.csc.csc_matrix'>
type(X_test) <class 'scipy.sparse.csc.csc_matrix'>
type(xgb_train) <class 'xgboost.core.DMatrix'>
type(xgb_test) <class 'xgboost.core.DMatrix'>

My original sparse matrix is output of sklearn tf-idf vectorizer were in csr_matrix format.

@pallavbakshi
Copy link

Is there any fix yet?

@ewellinger
Copy link

Just built the latest version (0.7.post3) in python3 and I can confirm that this issue still exists. After adapting @dmcgarry example above I am still seeing issues with both csr_matrix and csc_matrix.

import xgboost as xgb
import numpy as np
from scipy import sparse

np.random.seed(10)

X_csr = sparse.rand(100, 10).tocsr()
test_csr = sparse.rand(10, 500).tocsr()

X_csc = sparse.rand(100, 10).tocsc()
test_csc = sparse.rand(10, 500).tocsc()

y = np.random.randint(2, size=100)

clf_csr = xgb.XGBClassifier()
clf_csr.fit(X_csr, y)

clf_csc = xgb.XGBClassifier()
clf_csc.fit(X_csc, y)

# Try with csr
try:
    pred = clf_csr.predict_proba(test_csr)
    print("Works when csr with version %s" %xgb.__version__)
except ValueError:
    print("Broken when csr with version %s" %xgb.__version__)

try:
    test_csr[0,(test_csr.shape[1]-1)] = 1.0
    pred = clf_csr.predict_proba(test_csr)
    print("Works when adding non-zero entries to last column with version %s" %xgb.__version__)
except:
    print("Still broken when adding non-zero entries to last column with version %s" %xgb.__version__)

# Try with csc
try:
    pred = clf_csc.predict_proba(test_csc)
    print("Works when csc with version %s" %xgb.__version__)
except ValueError:
    print("Broken when csc with version %s" %xgb.__version__)

try:
    test_csc[0,(test_csc.shape[1]-1)] = 1.0
    pred = clf_csc.predict_proba(test_csc)
    print("Works when adding non-zero entries to last column with version %s" %xgb.__version__)
except:
    print("Still broken when adding non-zero entries to last column with version %s" %xgb.__version__)

The above code resulted in the following output:

Broken when csr with version 0.7
Still broken when adding non-zero entries to last column with version 0.7
Broken when csc with version 0.7
Still broken when adding non-zero entries to last column with version 0.7

@hhristov94
Copy link

pls help

@ewellinger
Copy link

Why has this issue been closed?

@CathyQian
Copy link

I came across this problem twice recently. For one case, I simply change the input dataframe into array and it works. For the second one, I have to realign the column names of the test dataframe using test_df = test_df[train_df.columns]. In both cases, the train_df and test_df have exactly the same column names.

@ewellinger
Copy link

I guess I don't understand your comment @CathyQian, are those train_df/test_df sparse? Also, which version of xgboost were you running when you ran into these issues?

@khotilov
Copy link
Member

khotilov commented Mar 2, 2018

@CathyQian xgboost relies on the order of columns, and that is not related to this issue.

@ewellinger WRT your example: a model trained on data with 10 features should not accept data with 500 features for prediction, hence the error is thrown. Also, creating DMatrices from all of your matrices and inspecting their num_col and num_row produces expected results.

The current state of "sparsity issues" is:

  • DMatrix creation from CSR and its use in a model should work correctly. The issue was closed since that was the subject of this issue.
  • DMatrix creation from CSC produces an object with correct dimensions, but it might give incorrect results during training or prediction when last rows are fully sparse Handling of NAs/missings - sparse and dense matrices giving different results #2630. I didn't yet have time to properly fix that part.
  • A parameter to specify a predefined number of columns when loading libsvm data into DMatrix was not implemented yet. Volunteers to contribute are welcome.

@rainness
Copy link

@warpuv it works for me, thanks a lot.

@ag95v2
Copy link

ag95v2 commented Jun 1, 2018

Had the same error, with dense matrices. ( xgboost v.0.6 from the latest anaconda.)
Error occured when I ran multiple regressions on different feature subsets of the training sample.
Creating new model instance each time before fitting a next regression fixed the problem.

@JulianKlug
Copy link

  • A parameter to specify a predefined number of columns when loading libsvm data into DMatrix was not implemented yet. Volunteers to contribute are welcome.

As of 0.8, this still doesn't exist right?

@hcho3
Copy link
Collaborator

hcho3 commented Oct 4, 2018

DMatrix creation from CSC produces an object with correct dimensions, but it might give incorrect results during training or prediction when last rows are fully sparse #2630. I didn't yet have time to properly fix that part.

@khotilov #3553 fixed this issue.

A parameter to specify a predefined number of columns when loading libsvm data into DMatrix was not implemented yet. Volunteers to contribute are welcome.

@MonsieurWave For this feature, a small pull request to dmlc-core should do the trick. Let me look at it.

@JulianKlug
Copy link

JulianKlug commented Oct 4, 2018

@hcho3 Thanks a lot.

For now, I circumvent this issue by having the first line of my libsvm not so sparse, ie.: saving even the columns with value 0.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests