New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature_names mismatch while using sparse matrices in Python #1238
Comments
It seems that this works only if the sparse matrics is CSC. It doesn't work for CSR or COO matrices like earlier versions. |
@sinhrks: For me, that's not "random". I frequently train XGBoost on highly sparse data (and it's awesome! It normally beats all other models, and by a pretty wide margin). Then, once I've got the trained model running in production, I'll, of course, want to make predictions on a new piece of incoming data. That data, of course, is highly likely to be sparse and not have a value for whatever column happens to be the last column. So XGBoost now frequently breaks for me, and I've found myself switching to other (less accurate) models, simply because they've got better support for sparse data. |
Does anyone know exactly why this error now arises and how to address it? This is a pain point to me as my existing scripts are failing. |
I am giving xgboost a try as part of an sklearn pipeline and ran into the same issue. Is there a workaround until its fixed? |
Yes, when you call predict, use the toarray() function of the sparse array. It is terribly inefficient with memory, but is workable with small slices. Sent from my iPhone
|
For some reason the error is not occurring if I save and load trained model:
|
@bryan-woods I was able to find a better work around with Including this in my sklearn pipeline right before xgboost worked class CSCTransformer(TransformerMixin):
def transform(self, X, y=None, **fit_params):
return X.tocsc()
def fit_transform(self, X, y=None, **fit_params):
self.fit(X, y, **fit_params)
return self.transform(X)
def fit(self, X, y=None, **fit_params):
return self
def get_params(self, deep=True):
return {} |
Neither CSC format nor adding non zero entries into the last column fixes the issue in the most recent version of xgboost. Reverting back to version 0.4a30 is the only thing I can get make it work, consider the following tweak (with a reproducible seed) on the original example:
|
Same issue here, something definitely got broken in the last release. Did not have this issue before with the same dataset and processing. I might be wrong, but it looks like currently there is no unit tests with sparse csr arrays in Python using the sklearn API. Would it be possible to add the @dmcgarry example above to |
I tried working around it using .toarray() with CSR sparse arrays, but something is seriously broken. If I load a saved model and try using it to make predictions with .toarray(), I don't get an error message but the results are incorrect. I rolled back to 0.4a30 and it works fine. I haven't had the time to chase down the root cause, but it's not good. |
The problem occurs due to DMatrix..num_col() only returning the amount of non-zero columns in a sparse matrix. Hence, if both train & test data have the same amount of non-zero columns, everything works fine.
Dunno yet, where the best spot is to fix that though. |
I'm also afraid, that there is a fundamental problem with sparse matrix handling, because of what @bryan-woods reported: Let's say, we have x zero-columns both in train and in test, but with different indices => There will be no error, because "feature_names(self)" returns the same feature list for both sets, but the predictions mith be wrong, due to not-matching non-zero column indices between train and test. |
Has anyone worked on this issue? Does anyone at least have a unit test developed that we could use to develop against? |
I have not worked on it but @dmcgarry's example above could be used as a beginning of a unit test, I think,
|
I created a couple new sparse array tests in my fork of the repo. For those who are interested: To run the tests from the root directory of the checkout: You'll notice that both tests fail. This at least will provide a test to develop against. |
I'm also experiencing this issue, but I not able to figure out what's the best way to fix until it is finally solved in the lib. |
move to #1583 |
you could add a feature to your feature list with a max feature index, such as maxid:0 |
passing a dataframe solved the issue for me |
how can I revert to version 0.4? |
pip install --upgrade xgboost==0.4a30 |
All types of sparse matrices didn't work for me (I'm working with tf-idf data). I had to revert to previous version. Thanks for the tip! |
Huh,
Seems bizarre that my .6 version has XGDMatrixCreateFromCSR instead of XGDMatrixCreateFromCSREx instructions, which don't take in the shape. |
Can someone please answer this question? I reverted back to 0.4 version and now it seems to work but I'm afraid if it's working properly because I'm still using really sparse matrices. |
@l3link nothing bizarre about it: version numbers (or pypi packages) are sometimes not updated for long time. E.g., the https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/VERSION file as of today was last changed on Jul 29th, and the last pypi package https://pypi.python.org/pypi/xgboost/ is dated Aug 9th. While the fix was submitted on Sep 23rd #1606. Please check out the latest code from github. |
I had this problem when I used pandas |
I too got rid of this error after converting dataframe to array. |
Reordering the columns in the test-set in the same order as the train set fixed this for me. I did:
|
I tried @warpuv solution and it worked. My data is large, I cannot load them to memory to reorder columns. |
Converting train/test csr matrices to csc worked for me
|
Converting to
My original sparse matrix is output of sklearn tf-idf vectorizer were in |
Is there any fix yet? |
Just built the latest version (0.7.post3) in python3 and I can confirm that this issue still exists. After adapting @dmcgarry example above I am still seeing issues with both
The above code resulted in the following output:
|
pls help |
Why has this issue been closed? |
I came across this problem twice recently. For one case, I simply change the input dataframe into array and it works. For the second one, I have to realign the column names of the test dataframe using test_df = test_df[train_df.columns]. In both cases, the train_df and test_df have exactly the same column names. |
I guess I don't understand your comment @CathyQian, are those |
@CathyQian xgboost relies on the order of columns, and that is not related to this issue. @ewellinger WRT your example: a model trained on data with 10 features should not accept data with 500 features for prediction, hence the error is thrown. Also, creating DMatrices from all of your matrices and inspecting their num_col and num_row produces expected results. The current state of "sparsity issues" is:
|
@warpuv it works for me, thanks a lot. |
Had the same error, with dense matrices. ( xgboost v.0.6 from the latest anaconda.) |
As of 0.8, this still doesn't exist right? |
@khotilov #3553 fixed this issue.
@MonsieurWave For this feature, a small pull request to dmlc-core should do the trick. Let me look at it. |
@hcho3 Thanks a lot. For now, I circumvent this issue by having the first line of my libsvm not so sparse, ie.: saving even the columns with value 0. |
I'm getting ValueError: feature_names mismatch while training xgboost with sparse matrices in python.
The xgboost version is latest from git. Older versions don't give this error. Error is returned during prediction time.
code
Full traceback:
The text was updated successfully, but these errors were encountered: