Problem transformation does not work with XGBoost #88

souravsingh · 2017-12-11T09:57:03Z

I am facing a problem using XGBoost for multi-label classification. Is this an expected behavior?

ChristianSch · 2017-12-11T10:00:13Z

Please post your relevant code so that we can see what you're trying to achieve.

souravsingh · 2017-12-11T10:11:44Z

Here is my code I am using-

import pandas as pd
import numpy as np
from sklearn.svm import SVC
#from sklearn.multioutput import ClassifierChain
from sklearn.naive_bayes import GaussianNB
from modlamp.sequences import MixedLibrary
from sklearn.model_selection import train_test_split
from skmultilearn.problem_transform import BinaryRelevance, LabelPowerset, ClassifierChain
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score
import xgboost as xgb
from mlxtend.classifier import StackingClassifier
data = pd.read_csv("full_dataset.csv")
y = data[['antiviral','antifungal','antibacterial']]
to_drop = ['# ID','Sequence','antiviral', 'antibacterial', 'antifungal']
X = data.drop(to_drop,axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
clf = LabelPowerset(xgb.XGBClassifier(n_estimators=500))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(y_pred)
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))

On running the program, the output shown is this-

Traceback (most recent call last):
  File "ml-prog.py", line 28, in <module>
    y_pred = clf.predict(X_test)
  File "/usr/local/lib/python2.7/dist-packages/skmultilearn/problem_transform/lp.py", line 71, in predict
    lp_prediction = self.classifier.predict(self.ensure_input_format(X))
  File "/usr/local/lib/python2.7/dist-packages/xgboost/sklearn.py", line 465, in predict
    ntree_limit=ntree_limit)
  File "/usr/local/lib/python2.7/dist-packages/xgboost/core.py", line 939, in predict
    self._validate_features(data)
  File "/usr/local/lib/python2.7/dist-packages/xgboost/core.py", line 1179, in _validate_features
    data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13'] [u'Length', u'BomanIndex', u'Aromaticity', u'AliphaticIndex', u'InstabilityIndex', u'Charge', u'MW', u'H_Eisenberg', u'uH_Eisenberg', u'H_GRAVY', u'uH_GRAVY', u'Z3_1', u'Z3_2', u'Z3_3']
expected f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f12, f13, f10, f11 in input data
training data did not have the following fields: Charge, BomanIndex, H_Eisenberg, Z3_1, Z3_2, H_GRAVY, uH_Eisenberg, MW, AliphaticIndex, Length, uH_GRAVY, Z3_3, Aromaticity, InstabilityIndex

souravsingh · 2017-12-11T10:12:28Z

The same error is obtained when using BinaryRelevance as a classifier.

ChristianSch · 2017-12-11T15:09:44Z

I think it's a problem on your side, something with your data. The following proof of concept runs flawlessly:

import pandas as pd
import numpy as np
from sklearn.svm import SVC
#from sklearn.multioutput import ClassifierChain
from sklearn.naive_bayes import GaussianNB
from modlamp.sequences import MixedLibrary
from sklearn.model_selection import train_test_split
from skmultilearn.problem_transform import BinaryRelevance, LabelPowerset, ClassifierChain
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score
import xgboost as xgb
from mlxtend.classifier import StackingClassifier
from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(sparse=True, n_labels=5, return_indicator='sparse', allow_unlabeled=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
clf = LabelPowerset(xgb.XGBClassifier(n_estimators=500))
clf.fit(X_train, y_train)

res:
LabelPowerset(classifier=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=500,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       require_dense=[True, True])y_pred = clf.predict(X_test)

print(y_pred)

res:
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (1, 1)	1
  (1, 2)	1
  (1, 3)	1
  (1, 4)	1
  (2, 0)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	1
  (2, 4)	1
  (3, 1)	1
  (3, 2)	1
  (3, 3)	1
  (3, 4)	1
  (4, 2)	1
  (4, 3)	1
  (4, 4)	1
  (5, 2)	1
  (5, 3)	1
  (5, 4)	1
  (6, 1)	1
  (6, 2)	1
  (6, 3)	1
  (6, 4)	1
  (7, 1)	1
  (7, 2)	1
  (7, 3)	1
  (7, 4)	1
  (8, 2)	1
  (8, 3)	1
  (8, 4)	1
  (9, 1)	1
  (9, 2)	1
  (9, 3)	1
  (9, 4)	1
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))

res:
The macro averaged F1-score is: 0.682

Can you report back if this works for you?

souravsingh · 2017-12-11T20:47:26Z

Hello,

I checked the execution of the program and problem transformation works fine if GaussianNB or other classifiers are used. It just doesn't work for XGBoost.

The dataset is available here for reference- https://goo.gl/ivJDsw

ChristianSch · 2017-12-11T21:05:53Z

I checked your data, the error seems to be related to the one mentioned here: dmlc/xgboost#1238

_X_test = scipy.sparse.csc_matrix(X_test)
_X_train = scipy.sparse.csc_matrix(X_train)
_y_train = scipy.sparse.csc_matrix(y_train)
_y_test = scipy.sparse.csc_matrix(y_test)
clf.fit(_X_train, _y_train)
y_pred = clf.predict(_X_test)
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))
# -> The macro averaged F1-score is: 0.359

So XGBoost is at fault, not scikit-multilearn from what I can gather ;) If this issue is resolved I'd like to close it afterwards.

ChristianSch mentioned this issue Dec 11, 2017

Error while performing Binary Relevance or Label Powerset #89

Closed

ChristianSch self-assigned this Dec 11, 2017

ChristianSch closed this as completed Dec 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem transformation does not work with XGBoost #88

Problem transformation does not work with XGBoost #88

souravsingh commented Dec 11, 2017

ChristianSch commented Dec 11, 2017

souravsingh commented Dec 11, 2017

souravsingh commented Dec 11, 2017

ChristianSch commented Dec 11, 2017 •

edited

souravsingh commented Dec 11, 2017

ChristianSch commented Dec 11, 2017

Problem transformation does not work with XGBoost #88

Problem transformation does not work with XGBoost #88

Comments

souravsingh commented Dec 11, 2017

ChristianSch commented Dec 11, 2017

souravsingh commented Dec 11, 2017

souravsingh commented Dec 11, 2017

ChristianSch commented Dec 11, 2017 • edited

souravsingh commented Dec 11, 2017

ChristianSch commented Dec 11, 2017

ChristianSch commented Dec 11, 2017 •

edited