Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict with PermutationImportance, DataFrame and XGBoost (with workaround) #256

Closed
ianozsvald opened this issue Mar 13, 2018 · 10 comments
Closed

Comments

@ianozsvald
Copy link

  • Python 3.5
  • XGBoost '0.7.post3'
  • sklearn '0.19.1'
  • Pandas '0.22.0'
  • ELI5 '0.8'

I'm working on a regression problem in insurance. XGB outperforms sklearn's tree ensembles. ELI5's show_weights works fine if I use XGBRegressor or RandomForestRegressor, if I use PermutationImportance then RandomForestRegressor works but XGBRegressor throws an error. I have a workaround noted below, I'm not sure where the problem lies. Any guidance would be happily received.

Estimator:

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=119,
       n_jobs=1, nthread=None, objective='reg:tweedie', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1, tweedie_variance_power=1.9)

X_test_dbg.shape (1759, 3) 
y_test_dbg.shape (1759,)

type(X_test_dbg) # pandas.core.frame.DataFrame

If I train est on the DataFrame version of the data - training, testing and ELI5's show_weights are fine
but PermutationImportance throws a feature-name error:

perm = PermutationImportance(est).fit(X_test_dbg, y_test_dbg)

ValueError: feature_names mismatch: ['myfeature1', 'myfeature2', 'myfeature3'] ['f0', 'f1', 'f2']
expected myfeature1, myfeature3, myfeature2 in input data
training data did not have the following fields: f1, f0, f2

If I change the type of the training data to an ndarray then PermutationImportance no longer throws an error:

X_train_dbg = X_train_dbg.as_matrix()
y_train_dbg = y_train_dbg.as_matrix()
X_test_dbg = X_test_dbg.as_matrix()
y_test_dbg = y_test_dbg.as_matrix()
type(X_train_dbg) # numpy.ndarray for all 4 variables
and then 
est.fit(X_train_dbg, y_train_dbg) 
perm = PermutationImportance(est).fit(X_test_dbg, y_test_dbg)

I'll note that with sklearn's RandomForestRegressor I can use Pandas DataFrames in PermutationImportance without problem, it is only an issue with XGBoost.

My features are two continuous values and one binary indicator, I have no categorical features. This is a smaller set of features from a larger problem (where the problem originated).

I got the clue to try as_matrix in this XGBoost bug report on sparse matrices - this might be a red herring as I'm not using any sparse matrices at all: dmlc/xgboost#1238 (comment)

Possibly related - this ELI5 bug report on Pandas and XGBoost has linked code, that code has a similar error (In[79] with show_predictions) to my error: #166 https://github.com/nareshshah139/titanic_rebranded/blob/master/ELI5_Example.ipynb but potentially as @kmike noted get_dummies is called separately on the train and validation set, so maybe they don't have matching feature names in this example?

@ianozsvald
Copy link
Author

In addition I confirm that I have the same issue with XGBClassifier (as opposed to the regressor used above) on a different machine. On a Titanic example with:

  • Python 3.6 (not 3.5 from above)
  • xgboost 0.7.post3
  • sklearn 0.19.1
  • pandas 0.22
  • eli5 0.8

In this case I have a classifier. RandomForestClassifier works fine, XGBClassifier works on most of my code but not for PermutationImportance if I have a DataFrame. If I use the .ax_matrix() workaround then it works ok.

@ianozsvald
Copy link
Author

I've written some test code so we can easily reproduce this bug (at PyDataLondon 2018 Conference):

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance

%load_ext watermark
%watermark -d -m -v -p numpy,sklearn,eli5,xgboost,pandas

# OUTPUT GENERATED:
2018-04-28 

CPython 3.6.5
IPython 6.3.1

numpy 1.14.2
sklearn 0.19.1
eli5 0.8
xgboost 0.71
pandas 0.22.0

compiler   : GCC 4.8.2 20140120 (Red Hat 4.8.2-15)
system     : Linux
release    : 4.9.91-040991-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit

This code makes a 15 element X and y with XGBClassifier and turns this into a DataFrame. When we run this in PermutationImportance it blows up:

# 8 items of data, pairs of useless feature and predictive feature
X_np = np.array([[0, 1,], [0, 1], [0, 1], [0, 1], [0, 1], [0, 2,], [0, 2,], [0, 2,], [0, 2,], [0, 2]])
y_np = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# if we have 8 items (prepared above) XGBClassifier won't fit (but RandomForestClassifer does)
# so the score is 0. If we concatenate to make "more data" (24 items in total) then XGBClassifier
# fits with 100% (as does RandomForestClassifier)
X_np = np.concatenate((X_np, X_np, X_np))
y_np = np.concatenate((y_np, y_np, y_np))

# convert to Pandas DataFrame - this is where the bug starts
X = pd.DataFrame(X_np)
y = pd.Series(y_np)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
print("X y shapes:", X_train.shape, y_train.shape, X_test.shape, y_test.shape) # (15, 2) (15,) (15, 2) (15,)

# for diagnostics - we can swap in RFC and confirm that non-XGBoost works fine in all cases
#est = RandomForestClassifier()
est = XGBClassifier()
est.fit(X_train, y_train)
print("Classifier score (should be 1.0):", est.score(X_test, y_test))

perm = PermutationImportance(est).fit(X_test, y_test)
eli5.show_weights(perm)

OUTPUT:

~/anaconda3/envs/debug_xgb_pandas_eli5/lib/python3.6/site-packages/xgboost/core.py in _validate_features(self, data)
   1306 
   1307                 raise ValueError(msg.format(self.feature_names,
-> 1308                                             data.feature_names))
   1309 
   1310     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['0', '1'] ['f0', 'f1']
expected 0, 1 in input data
training data did not have the following fields: f0, f1

If we use as_matrix then the code runs and we get a Permuation output:

# convert DataFrames to numpy matrices (2d)
X_train_dbg = X_train.as_matrix()
y_train_dbg = y_train.as_matrix()
X_test_dbg = X_test.as_matrix()
y_test_dbg = y_test.as_matrix()
# type(X_train_dbg) # numpy.ndarray for all 4 variables
print("X y shapes:", X_train_dbg.shape, y_train_dbg.shape, X_test_dbg.shape, y_test_dbg.shape) # (15, 2) (15,) (15, 2) (15,)

est.fit(X_train_dbg, y_train_dbg) 
perm = PermutationImportance(est).fit(X_test_dbg, y_test_dbg)

eli5.show_weights(perm)
# works fine

If we use the original numpy matrices it also works fine:

# use the original numpy arrays
X = X_np
y = y_np

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

print("X y shapes:", X_train.shape, y_train.shape, X_test.shape, y_test.shape) # (15, 2) (15,) (15, 2) (15,)
#X_train # show example

est = XGBClassifier()
est.fit(X_train, y_train)

perm = PermutationImportance(est).fit(X_test, y_test)
eli5.show_weights(perm)

@ianozsvald
Copy link
Author

The above was posted at https://pydata.org/london2018/ whilst I ran a "Make your first open source contribution" session. Possibly one of my attendees will recreate this bug to confirm that they see the same issue. I'm also hoping that someone has a crack at making two tests:

  • Using numpy (not Pandas) and calling PermutationImportance and having 0 failures
  • Using a DataFrame and calling the same and having 1 failure

@darioka
Copy link

darioka commented Apr 28, 2018

I recreated the error on my machine (ubuntu 16.04):

CPython 3.6.4
IPython 6.2.1

numpy 1.13.3
sklearn 0.19.1
eli5 0.8
xgboost 0.7.post3
pandas 0.22.0

compiler   : GCC 7.2.0
system     : Linux
release    : 4.13.0-39-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit

This is the output with pandas DataFrame:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-24bb05021e47> in <module>()
     22 print("Classifier score (should be 1.0):", est.score(X_test, y_test))
     23 
---> 24 perm = PermutationImportance(est).fit(X_test, y_test)
     25 eli5.show_weights(perm)

~/anaconda3/envs/unconference/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py in fit(self, X, y, groups, **fit_params)
    188             si = self._cv_scores_importances(X, y, groups=groups, **fit_params)
    189         else:
--> 190             si = self._non_cv_scores_importances(X, y)
    191         scores, results = si
    192         self.scores_ = np.array(scores)

~/anaconda3/envs/unconference/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py in _non_cv_scores_importances(self, X, y)
    212     def _non_cv_scores_importances(self, X, y):
    213         score_func = partial(self.scorer_, self.wrapped_estimator_)
--> 214         base_score, importances = self._get_score_importances(score_func, X, y)
    215         return [base_score] * len(importances), importances
    216 

~/anaconda3/envs/unconference/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py in _get_score_importances(self, score_func, X, y)
    217     def _get_score_importances(self, score_func, X, y):
    218         return get_score_importances(score_func, X, y, n_iter=self.n_iter,
--> 219                                      random_state=self.rng_)
    220 
    221     @property

~/anaconda3/envs/unconference/lib/python3.6/site-packages/eli5/permutation_importance.py in get_score_importances(score_func, X, y, n_iter, columns_to_shuffle, random_state)
     84     """
     85     rng = check_random_state(random_state)
---> 86     base_score = score_func(X, y)
     87     scores_decreases = []
     88     for i in range(n_iter):

~/anaconda3/envs/unconference/lib/python3.6/site-packages/sklearn/metrics/scorer.py in _passthrough_scorer(estimator, *args, **kwargs)
    242 def _passthrough_scorer(estimator, *args, **kwargs):
    243     """Function that wraps estimator.score"""
--> 244     return estimator.score(*args, **kwargs)
    245 
    246 

~/anaconda3/envs/unconference/lib/python3.6/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    347         """
    348         from .metrics import accuracy_score
--> 349         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    350 
    351 

~/anaconda3/envs/unconference/lib/python3.6/site-packages/xgboost/sklearn.py in predict(self, data, output_margin, ntree_limit)
    524         class_probs = self.get_booster().predict(test_dmatrix,
    525                                                  output_margin=output_margin,
--> 526                                                  ntree_limit=ntree_limit)
    527         if len(class_probs.shape) > 1:
    528             column_indexes = np.argmax(class_probs, axis=1)

~/anaconda3/envs/unconference/lib/python3.6/site-packages/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs)
   1042             option_mask |= 0x08
   1043 
-> 1044         self._validate_features(data)
   1045 
   1046         length = c_bst_ulong()

~/anaconda3/envs/unconference/lib/python3.6/site-packages/xgboost/core.py in _validate_features(self, data)
   1286 
   1287                 raise ValueError(msg.format(self.feature_names,
-> 1288                                             data.feature_names))
   1289 
   1290     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['0', '1'] ['f0', 'f1']
expected 0, 1 in input data
training data did not have the following fields: f0, f1

Everything works fine with as_matrix and with the original numpy arrays.

@harrysalmon
Copy link

harrysalmon commented Apr 28, 2018

Also recreated on mac 10.12.6 outside of ipython.

pandas==0.22.0
scikit-learn==0.19.1
eli5==0.8
xgboost==0.71
numpy==1.14.2

X y shapes: (15, 2) (15,) (15, 2) (15,)
/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
Classifier score (should be 1.0): 1.0
Traceback (most recent call last):
  File "xgboost_eli5.py", line 35, in <module>
    perm = PermutationImportance(est).fit(X_test, y_test)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py", line 190, in fit
    si = self._non_cv_scores_importances(X, y)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py", line 214, in _non_cv_scores_importances
    base_score, importances = self._get_score_importances(score_func, X, y)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py", line 219, in _get_score_importances
    random_state=self.rng_)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/eli5/permutation_importance.py", line 86, in get_score_importances
    base_score = score_func(X, y)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/sklearn/metrics/scorer.py", line 244, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/sklearn/base.py", line 349, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/xgboost/sklearn.py", line 544, in predict
    ntree_limit=ntree_limit)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/xgboost/core.py", line 1050, in predict
    self._validate_features(data)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/xgboost/core.py", line 1308, in _validate_features
    data.feature_names))
ValueError: feature_names mismatch: ['0', '1'] ['f0', 'f1']
expected 1, 0 in input data
training data did not have the following fields: f0, f1

@mm5631
Copy link

mm5631 commented Apr 30, 2018

Very inconvenient issue, I really hope someone gets time to look into this!

@ianozsvald
Copy link
Author

ianozsvald commented Aug 12, 2018

By way of an update I still see this bug using a fresh conda environment, watermark reports:

2018-08-12 

CPython 3.6.6
IPython 6.5.0

numpy 1.15.0
matplotlib 2.2.2
sklearn 0.19.1
xgboost 0.72.1
seaborn 0.9.0
pandas 0.23.4
eli5 0.8

In addition using the solution with .as_matrix that I posted above, I now see the following warnings:
ipykernel_launcher.py:55: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.

Swapping .as_matrix() for .values solves this warning.

In addition the lines perm = PermutationImportance(clf).fit(X_test_dbg, y_test_dbg) and perm.fit(X_test, y_test) both generate many repetitions of:
sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.

The above warnings look very similar to this bug in sklearn which is fixed, but isn't yet released (it isn't in the public 0.19.1) so these warnings probably go away when sklearn gets a new release: scikit-learn/scikit-learn#9816

@hofesh
Copy link
Contributor

hofesh commented Feb 12, 2019

Can confirm this issue is still active

@hofesh
Copy link
Contributor

hofesh commented Feb 12, 2019

Created a fix PR for this issue here.

Also in the mean time, there is a better work around that doesn't require retraining a model, simply pass the below scorer into the PermutationImportance at init:

def scorer(model, X, y):
    return model.score(pd.DataFrame(X, columns=X_train.columns), y)

perm = PermutationImportance(est, scoring=scorer).fit(X_test, y_test)

@lopuhin
Copy link
Contributor

lopuhin commented Mar 4, 2019

Since #296 is merged, I think this can be closed, thanks @hofesh for the fix, @ianozsvald for reporting and diagnosing, and to everyone who contributed. If there is anything not fixed yet, please comment and we'll reopen (or open a new one).

@lopuhin lopuhin closed this as completed Mar 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants