Conflict with PermutationImportance, DataFrame and XGBoost (with workaround) #256

ianozsvald · 2018-03-13T11:57:11Z

Python 3.5
XGBoost '0.7.post3'
sklearn '0.19.1'
Pandas '0.22.0'
ELI5 '0.8'

I'm working on a regression problem in insurance. XGB outperforms sklearn's tree ensembles. ELI5's show_weights works fine if I use XGBRegressor or RandomForestRegressor, if I use PermutationImportance then RandomForestRegressor works but XGBRegressor throws an error. I have a workaround noted below, I'm not sure where the problem lies. Any guidance would be happily received.

Estimator:

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=119,
       n_jobs=1, nthread=None, objective='reg:tweedie', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1, tweedie_variance_power=1.9)

X_test_dbg.shape (1759, 3) 
y_test_dbg.shape (1759,)

type(X_test_dbg) # pandas.core.frame.DataFrame

If I train est on the DataFrame version of the data - training, testing and ELI5's show_weights are fine
but PermutationImportance throws a feature-name error:

perm = PermutationImportance(est).fit(X_test_dbg, y_test_dbg)

ValueError: feature_names mismatch: ['myfeature1', 'myfeature2', 'myfeature3'] ['f0', 'f1', 'f2']
expected myfeature1, myfeature3, myfeature2 in input data
training data did not have the following fields: f1, f0, f2

If I change the type of the training data to an ndarray then PermutationImportance no longer throws an error:

X_train_dbg = X_train_dbg.as_matrix()
y_train_dbg = y_train_dbg.as_matrix()
X_test_dbg = X_test_dbg.as_matrix()
y_test_dbg = y_test_dbg.as_matrix()
type(X_train_dbg) # numpy.ndarray for all 4 variables
and then 
est.fit(X_train_dbg, y_train_dbg) 
perm = PermutationImportance(est).fit(X_test_dbg, y_test_dbg)

I'll note that with sklearn's RandomForestRegressor I can use Pandas DataFrames in PermutationImportance without problem, it is only an issue with XGBoost.

My features are two continuous values and one binary indicator, I have no categorical features. This is a smaller set of features from a larger problem (where the problem originated).

I got the clue to try as_matrix in this XGBoost bug report on sparse matrices - this might be a red herring as I'm not using any sparse matrices at all: dmlc/xgboost#1238 (comment)

Possibly related - this ELI5 bug report on Pandas and XGBoost has linked code, that code has a similar error (In[79] with show_predictions) to my error: #166 https://github.com/nareshshah139/titanic_rebranded/blob/master/ELI5_Example.ipynb but potentially as @kmike noted get_dummies is called separately on the train and validation set, so maybe they don't have matching feature names in this example?

The text was updated successfully, but these errors were encountered:

ianozsvald · 2018-03-14T12:21:34Z

In addition I confirm that I have the same issue with XGBClassifier (as opposed to the regressor used above) on a different machine. On a Titanic example with:

Python 3.6 (not 3.5 from above)
xgboost 0.7.post3
sklearn 0.19.1
pandas 0.22
eli5 0.8

In this case I have a classifier. RandomForestClassifier works fine, XGBClassifier works on most of my code but not for PermutationImportance if I have a DataFrame. If I use the .ax_matrix() workaround then it works ok.

ianozsvald · 2018-04-28T09:35:03Z

I've written some test code so we can easily reproduce this bug (at PyDataLondon 2018 Conference):

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance

%load_ext watermark
%watermark -d -m -v -p numpy,sklearn,eli5,xgboost,pandas

# OUTPUT GENERATED:
2018-04-28 

CPython 3.6.5
IPython 6.3.1

numpy 1.14.2
sklearn 0.19.1
eli5 0.8
xgboost 0.71
pandas 0.22.0

compiler   : GCC 4.8.2 20140120 (Red Hat 4.8.2-15)
system     : Linux
release    : 4.9.91-040991-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit

This code makes a 15 element X and y with XGBClassifier and turns this into a DataFrame. When we run this in PermutationImportance it blows up:

# 8 items of data, pairs of useless feature and predictive feature
X_np = np.array([[0, 1,], [0, 1], [0, 1], [0, 1], [0, 1], [0, 2,], [0, 2,], [0, 2,], [0, 2,], [0, 2]])
y_np = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# if we have 8 items (prepared above) XGBClassifier won't fit (but RandomForestClassifer does)
# so the score is 0. If we concatenate to make "more data" (24 items in total) then XGBClassifier
# fits with 100% (as does RandomForestClassifier)
X_np = np.concatenate((X_np, X_np, X_np))
y_np = np.concatenate((y_np, y_np, y_np))

# convert to Pandas DataFrame - this is where the bug starts
X = pd.DataFrame(X_np)
y = pd.Series(y_np)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
print("X y shapes:", X_train.shape, y_train.shape, X_test.shape, y_test.shape) # (15, 2) (15,) (15, 2) (15,)

# for diagnostics - we can swap in RFC and confirm that non-XGBoost works fine in all cases
#est = RandomForestClassifier()
est = XGBClassifier()
est.fit(X_train, y_train)
print("Classifier score (should be 1.0):", est.score(X_test, y_test))

perm = PermutationImportance(est).fit(X_test, y_test)
eli5.show_weights(perm)

OUTPUT:

~/anaconda3/envs/debug_xgb_pandas_eli5/lib/python3.6/site-packages/xgboost/core.py in _validate_features(self, data)
   1306 
   1307                 raise ValueError(msg.format(self.feature_names,
-> 1308                                             data.feature_names))
   1309 
   1310     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['0', '1'] ['f0', 'f1']
expected 0, 1 in input data
training data did not have the following fields: f0, f1

If we use as_matrix then the code runs and we get a Permuation output:

# convert DataFrames to numpy matrices (2d)
X_train_dbg = X_train.as_matrix()
y_train_dbg = y_train.as_matrix()
X_test_dbg = X_test.as_matrix()
y_test_dbg = y_test.as_matrix()
# type(X_train_dbg) # numpy.ndarray for all 4 variables
print("X y shapes:", X_train_dbg.shape, y_train_dbg.shape, X_test_dbg.shape, y_test_dbg.shape) # (15, 2) (15,) (15, 2) (15,)

est.fit(X_train_dbg, y_train_dbg) 
perm = PermutationImportance(est).fit(X_test_dbg, y_test_dbg)

eli5.show_weights(perm)
# works fine

If we use the original numpy matrices it also works fine:

# use the original numpy arrays
X = X_np
y = y_np

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

print("X y shapes:", X_train.shape, y_train.shape, X_test.shape, y_test.shape) # (15, 2) (15,) (15, 2) (15,)
#X_train # show example

est = XGBClassifier()
est.fit(X_train, y_train)

perm = PermutationImportance(est).fit(X_test, y_test)
eli5.show_weights(perm)

ianozsvald · 2018-04-28T09:49:23Z

The above was posted at https://pydata.org/london2018/ whilst I ran a "Make your first open source contribution" session. Possibly one of my attendees will recreate this bug to confirm that they see the same issue. I'm also hoping that someone has a crack at making two tests:

Using numpy (not Pandas) and calling PermutationImportance and having 0 failures
Using a DataFrame and calling the same and having 1 failure

darioka · 2018-04-28T09:58:58Z

I recreated the error on my machine (ubuntu 16.04):

CPython 3.6.4
IPython 6.2.1

numpy 1.13.3
sklearn 0.19.1
eli5 0.8
xgboost 0.7.post3
pandas 0.22.0

compiler   : GCC 7.2.0
system     : Linux
release    : 4.13.0-39-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit

This is the output with pandas DataFrame:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-24bb05021e47> in <module>()
     22 print("Classifier score (should be 1.0):", est.score(X_test, y_test))
     23 
---> 24 perm = PermutationImportance(est).fit(X_test, y_test)
     25 eli5.show_weights(perm)

~/anaconda3/envs/unconference/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py in fit(self, X, y, groups, **fit_params)
    188             si = self._cv_scores_importances(X, y, groups=groups, **fit_params)
    189         else:
--> 190             si = self._non_cv_scores_importances(X, y)
    191         scores, results = si
    192         self.scores_ = np.array(scores)

~/anaconda3/envs/unconference/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py in _non_cv_scores_importances(self, X, y)
    212     def _non_cv_scores_importances(self, X, y):
    213         score_func = partial(self.scorer_, self.wrapped_estimator_)
--> 214         base_score, importances = self._get_score_importances(score_func, X, y)
    215         return [base_score] * len(importances), importances
    216 

~/anaconda3/envs/unconference/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py in _get_score_importances(self, score_func, X, y)
    217     def _get_score_importances(self, score_func, X, y):
    218         return get_score_importances(score_func, X, y, n_iter=self.n_iter,
--> 219                                      random_state=self.rng_)
    220 
    221     @property

~/anaconda3/envs/unconference/lib/python3.6/site-packages/eli5/permutation_importance.py in get_score_importances(score_func, X, y, n_iter, columns_to_shuffle, random_state)
     84     """
     85     rng = check_random_state(random_state)
---> 86     base_score = score_func(X, y)
     87     scores_decreases = []
     88     for i in range(n_iter):

~/anaconda3/envs/unconference/lib/python3.6/site-packages/sklearn/metrics/scorer.py in _passthrough_scorer(estimator, *args, **kwargs)
    242 def _passthrough_scorer(estimator, *args, **kwargs):
    243     """Function that wraps estimator.score"""
--> 244     return estimator.score(*args, **kwargs)
    245 
    246 

~/anaconda3/envs/unconference/lib/python3.6/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    347         """
    348         from .metrics import accuracy_score
--> 349         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    350 
    351 

~/anaconda3/envs/unconference/lib/python3.6/site-packages/xgboost/sklearn.py in predict(self, data, output_margin, ntree_limit)
    524         class_probs = self.get_booster().predict(test_dmatrix,
    525                                                  output_margin=output_margin,
--> 526                                                  ntree_limit=ntree_limit)
    527         if len(class_probs.shape) > 1:
    528             column_indexes = np.argmax(class_probs, axis=1)

~/anaconda3/envs/unconference/lib/python3.6/site-packages/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs)
   1042             option_mask |= 0x08
   1043 
-> 1044         self._validate_features(data)
   1045 
   1046         length = c_bst_ulong()

~/anaconda3/envs/unconference/lib/python3.6/site-packages/xgboost/core.py in _validate_features(self, data)
   1286 
   1287                 raise ValueError(msg.format(self.feature_names,
-> 1288                                             data.feature_names))
   1289 
   1290     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['0', '1'] ['f0', 'f1']
expected 0, 1 in input data
training data did not have the following fields: f0, f1

Everything works fine with as_matrix and with the original numpy arrays.

harrysalmon · 2018-04-28T10:13:42Z

Also recreated on mac 10.12.6 outside of ipython.

pandas==0.22.0
scikit-learn==0.19.1
eli5==0.8
xgboost==0.71
numpy==1.14.2

X y shapes: (15, 2) (15,) (15, 2) (15,)
/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
Classifier score (should be 1.0): 1.0
Traceback (most recent call last):
  File "xgboost_eli5.py", line 35, in <module>
    perm = PermutationImportance(est).fit(X_test, y_test)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py", line 190, in fit
    si = self._non_cv_scores_importances(X, y)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py", line 214, in _non_cv_scores_importances
    base_score, importances = self._get_score_importances(score_func, X, y)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/eli5/sklearn/permutation_importance.py", line 219, in _get_score_importances
    random_state=self.rng_)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/eli5/permutation_importance.py", line 86, in get_score_importances
    base_score = score_func(X, y)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/sklearn/metrics/scorer.py", line 244, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/sklearn/base.py", line 349, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/xgboost/sklearn.py", line 544, in predict
    ntree_limit=ntree_limit)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/xgboost/core.py", line 1050, in predict
    self._validate_features(data)
  File "/Users/harrysalmon/anaconda/envs/temp/lib/python3.6/site-packages/xgboost/core.py", line 1308, in _validate_features
    data.feature_names))
ValueError: feature_names mismatch: ['0', '1'] ['f0', 'f1']
expected 1, 0 in input data
training data did not have the following fields: f0, f1

mm5631 · 2018-04-30T09:50:00Z

Very inconvenient issue, I really hope someone gets time to look into this!

ianozsvald · 2018-08-12T17:45:00Z

By way of an update I still see this bug using a fresh conda environment, watermark reports:

2018-08-12 

CPython 3.6.6
IPython 6.5.0

numpy 1.15.0
matplotlib 2.2.2
sklearn 0.19.1
xgboost 0.72.1
seaborn 0.9.0
pandas 0.23.4
eli5 0.8

In addition using the solution with .as_matrix that I posted above, I now see the following warnings:
ipykernel_launcher.py:55: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.

Swapping .as_matrix() for .values solves this warning.

In addition the lines perm = PermutationImportance(clf).fit(X_test_dbg, y_test_dbg) and perm.fit(X_test, y_test) both generate many repetitions of:
sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.

The above warnings look very similar to this bug in sklearn which is fixed, but isn't yet released (it isn't in the public 0.19.1) so these warnings probably go away when sklearn gets a new release: scikit-learn/scikit-learn#9816

hofesh · 2019-02-12T07:41:10Z

Can confirm this issue is still active

hofesh · 2019-02-12T11:07:45Z

Created a fix PR for this issue here.

Also in the mean time, there is a better work around that doesn't require retraining a model, simply pass the below scorer into the PermutationImportance at init:

def scorer(model, X, y):
    return model.score(pd.DataFrame(X, columns=X_train.columns), y)

perm = PermutationImportance(est, scoring=scorer).fit(X_test, y_test)

lopuhin · 2019-03-04T14:31:24Z

Since #296 is merged, I think this can be closed, thanks @hofesh for the fix, @ianozsvald for reporting and diagnosing, and to everyone who contributed. If there is anything not fixed yet, please comment and we'll reopen (or open a new one).

harrysalmon mentioned this issue Apr 28, 2018

Test PermutationImportance with XGBClassifier and pd.DataFrame issue#256 #261

Open

natskr mentioned this issue May 1, 2018

feature_names mismatch when using xgboost + sklearn (XGBClassifier) + eli5(explain_prediction) dmlc/xgboost#2334

Closed

ianozsvald mentioned this issue Aug 14, 2018

XGboost Feature mismatch EpistasisLab/tpot#738

Closed

hofesh mentioned this issue Feb 12, 2019

Support XGBClassifier with pandas DataFrame in PermutationImportance #296

Merged

lopuhin closed this as completed Mar 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflict with PermutationImportance, DataFrame and XGBoost (with workaround) #256

Conflict with PermutationImportance, DataFrame and XGBoost (with workaround) #256

ianozsvald commented Mar 13, 2018

ianozsvald commented Mar 14, 2018

ianozsvald commented Apr 28, 2018

ianozsvald commented Apr 28, 2018

darioka commented Apr 28, 2018 •

edited

harrysalmon commented Apr 28, 2018 •

edited

mm5631 commented Apr 30, 2018 •

edited

ianozsvald commented Aug 12, 2018 •

edited

hofesh commented Feb 12, 2019

hofesh commented Feb 12, 2019 •

edited

lopuhin commented Mar 4, 2019

Conflict with PermutationImportance, DataFrame and XGBoost (with workaround) #256

Conflict with PermutationImportance, DataFrame and XGBoost (with workaround) #256

Comments

ianozsvald commented Mar 13, 2018

ianozsvald commented Mar 14, 2018

ianozsvald commented Apr 28, 2018

ianozsvald commented Apr 28, 2018

darioka commented Apr 28, 2018 • edited

harrysalmon commented Apr 28, 2018 • edited

mm5631 commented Apr 30, 2018 • edited

ianozsvald commented Aug 12, 2018 • edited

hofesh commented Feb 12, 2019

hofesh commented Feb 12, 2019 • edited

lopuhin commented Mar 4, 2019

darioka commented Apr 28, 2018 •

edited

harrysalmon commented Apr 28, 2018 •

edited

mm5631 commented Apr 30, 2018 •

edited

ianozsvald commented Aug 12, 2018 •

edited

hofesh commented Feb 12, 2019 •

edited