Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip version 0.6 does not support sparse feature vectors #1456

Closed
WladimirSidorenko opened this issue Aug 10, 2016 · 2 comments
Closed

pip version 0.6 does not support sparse feature vectors #1456

WladimirSidorenko opened this issue Aug 10, 2016 · 2 comments

Comments

@WladimirSidorenko
Copy link

Problem Description

In contrast to the versions 0.4, PIP version 0.6 of the package does not support sparse feature vectors. This makes the new version backward incompatible to the previous state, so you should either roll this change back or increase the major version number.

0.4

import xgboost
xgboost.__version__
# '0.4'

from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline

model = Pipeline([("vect", DictVectorizer()), ("clf", xgboost.XGBClassifier())])

x = [{"feat{:d}".format(x_i): 1}
             for x_i in xrange(10)]
y = [y_i for y_i in reversed(xrange(10))]

model.fit(x, y)
# Pipeline(steps=[('vect', DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True,
#        sparse=True)), ('clf', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
#       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
#       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
#       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
#       scale_pos_weight=1, seed=0, silent=True, subsample=1))])
for x_i in x:
      model.predict_proba(x_i)
# array([[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]], dtype=float32)
# array([[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]], dtype=float32)
# array([[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]], dtype=float32)
# array([[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]], dtype=float32)
# array([[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]], dtype=float32)
# array([[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]], dtype=float32)
# array([[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]], dtype=float32)
# array([[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]], dtype=float32)
# array([[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]], dtype=float32)
# array([[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]], dtype=float32)

0.6

import xgboost
xgboost.__version__
# '0.6'

from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline

model = Pipeline([("vect", DictVectorizer()), ("clf", xgboost.XGBClassifier())])

x = [{"feat{:d}".format(x_i): 1}
             for x_i in xrange(10)]
y = [y_i for y_i in reversed(xrange(10))]

model.fit(x, y)
# Pipeline(steps=[('vect', DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True,
#        sparse=True)), ('clf', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
#       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
#       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
#       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
#       scale_pos_weight=1, seed=0, silent=True, subsample=1))])
for x_i in x:
      model.predict_proba(x_i)
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/sidorenko/Projects/DiscourseSenser/venv/local/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "/home/sidorenko/Projects/DiscourseSenser/venv/local/lib/python2.7/site-packages/sklearn/pipeline.py", line 240, in predict_proba
    return self.steps[-1][-1].predict_proba(Xt)
  File "/home/sidorenko/Projects/DiscourseSenser/venv/local/lib/python2.7/site-packages/xgboost/sklearn.py", line 477, in predict_proba
    ntree_limit=ntree_limit)
  File "/home/sidorenko/Projects/DiscourseSenser/venv/local/lib/python2.7/site-packages/xgboost/core.py", line 939, in predict
    self._validate_features(data)
  File "/home/sidorenko/Projects/DiscourseSenser/venv/local/lib/python2.7/site-packages/xgboost/core.py", line 1179, in _validate_features
    data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9'] ['f0']
expected f1, f2, f3, f4, f5, f6, f7, f8, f9 in input data

System Specification

uname -srm
Linux 3.19.0-42-generic x86_64

pip --version
pip 8.1.1

python --version
Python 2.7.6
@phunterlau
Copy link
Contributor

@WladimirSidorenko it is the new change in xgboost itself in 0.6 instead of pip installation change. please refer to #1238 in short, from @abhishekkrthakur

It seems that this works only if the sparse matrics is CSC. It doesn't work for CSR or COO matrices like earlier versions.

however, if you want, you can always install a previous version pip install xgboost==0.4a30 since I haven't hide this one, just in case some one still loves 0.4 version./

@WladimirSidorenko
Copy link
Author

duplicate issue ( #1238 )

@lock lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants