Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with dabl.explain() #258

Open
Luerken opened this issue Sep 23, 2020 · 3 comments
Open

Bug with dabl.explain() #258

Luerken opened this issue Sep 23, 2020 · 3 comments

Comments

@Luerken
Copy link

Luerken commented Sep 23, 2020

I've tried dabl with my own data but I replicated the "quickstart guide".

In particular I have

X_train = train_df_clean
y_train = train_df.is_match
simple_clf.fit(X_train, y_train)

which had the following results:

Running DummyClassifier(strategy='prior')
accuracy: 0.993 average_precision: 0.007 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.498
=== new best DummyClassifier(strategy='prior') (using recall_macro):
accuracy: 0.993 average_precision: 0.007 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.498

Running GaussianNB()
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
=== new best GaussianNB() (using recall_macro):
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000

Running MultinomialNB()
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
Running DecisionTreeClassifier(class_weight='balanced', max_depth=1)
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
Running DecisionTreeClassifier(class_weight='balanced', max_depth=5)
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01)
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000)
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 0.997
Running LogisticRegression(class_weight='balanced', max_iter=1000)
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000

Best model:
GaussianNB()
Best Scores:
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
SimpleClassifier(random_state=1234, refit=True, shuffle=True, type_hints=None,
                 verbose=1)

This is able to make predictions without issue, but when I tried running:
dabl.explain(simple_clf)
I got the following warning:


TypeErrorTraceback (most recent call last)
/usr/local/lib/python3.7/site-packages/dabl/explain.py in _extract_inner_estimator(estimator, feature_names)
    235             feature_names = inner_estimator.steps[0][1].get_feature_names(
--> 236                 feature_names)
    237         except TypeError:

TypeError: get_feature_names() takes 1 positional argument but 2 were given

During handling of the above exception, another exception occurred:

ValueErrorTraceback (most recent call last)
<ipython-input-36-37d2322eefb7> in <module>
----> 1 dabl.explain(simple_clf)

/usr/local/lib/python3.7/site-packages/dabl/explain.py in explain(estimator, X_val, y_val, target_col, feature_names, n_top_features)
    123 
    124     inner_estimator, inner_feature_names = _extract_inner_estimator(
--> 125         estimator, feature_names)
    126 
    127     if X_val is not None:

/usr/local/lib/python3.7/site-packages/dabl/explain.py in _extract_inner_estimator(estimator, feature_names)
    236                 feature_names)
    237         except TypeError:
--> 238             feature_names = inner_estimator.steps[0][1].get_feature_names()
    239 
    240         # now we have input feature names for the final step

/usr/local/lib/python3.7/site-packages/dabl/preprocessing.py in get_feature_names(self)
    607                 # FIXME that is really strange?!
    608                 ohe_cols = self.columns_[self.columns_.map(cols)]
--> 609                 feature_names.extend(ohe.get_feature_names(ohe_cols))
    610             elif name == "remainder":
    611                 assert trans == "drop"

/usr/local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in get_feature_names(self, input_features)
    530                 "input_features should have length equal to number of "
    531                 "features ({}), got {}".format(len(self.categories_),
--> 532                                                len(input_features)))
    533 
    534         feature_names = []

ValueError: input_features should have length equal to number of features (16), got 14

My train_df_clean has the following structure:

continuous dirty_float low_card_int categorical date free_string useless
id False False False False False True False
wt_ratio True False False False False False False
cat_a False False False False False True False
cat_b False False False False False True False
cat_c False False False False False True False
cat_d False False False False False True False
cat_e False False False True False False False
cat_f False False False True False False False
cat_g False False False True False False False
alt_id False False False False False True False
type_1 False False False False False True False
type_2 False False False False False True False
class False False False True False False False
subclass_1 False False False True False False False
subclass_2 False False False False False True False
in_cap False False False False False True False
tax_1 False False False True False False False
tax_2 False False False True False False False
col False False False False False True False
cap_loc False False False True False False False
'grp_1` False False False True False False False
grp_2 False False False False False True False
grp_3 False False False False False True False
temp False False False False False True False
notes_1 False False False False False True False
notes_2 False False False False False True False
meas_1 True False False False False False False
meas_2 True False False False False False False
meas_3 True False False False False False False
meas_4 False False False True False False False
len True False False False False False False
loc_area1 False False False True False False False
loc_area2 False False False True False False False
loc_area3 False False False True False False False
loc_area4 False False False True False False False

And I can see that I have 14 categories which seems to align with the count in the error, but I would have assumed that the GaussianNB would also use the continuous fields.

Thank you in advance!

@Luerken
Copy link
Author

Luerken commented Sep 24, 2020

I have done more manual digging and determined that this issue appears to lie with a categorical value being turned into a continuous RV. When pulling other features out, a Decision Tree was the best model and the dabl.explain function worked without issue. However, once the NB model won out, it broke the function. Additionally, there was interesting behavior when nan values existed in the columns.

@amueller
Copy link
Collaborator

Thanks for reporting! Indeed the explain function is not very robust yet.
Scikit-learn makes mapping input to output columns a bit hard, which will hopefully be improved by scikit-learn/scikit-learn#16772

I'll see what I can do in the meantime, dabl also needs some updates for the current version on sklearn, which I'll probably try to make work first.

@Luerken
Copy link
Author

Luerken commented Sep 28, 2020

Awesome, thank you! I was planning on looking through the code to try and understand more as well. But thank you for your work on this! I think it is a quite cool library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants