Bug with dabl.explain() #258

Luerken · 2020-09-23T17:36:06Z

I've tried dabl with my own data but I replicated the "quickstart guide".

In particular I have

X_train = train_df_clean
y_train = train_df.is_match
simple_clf.fit(X_train, y_train)

which had the following results:

Running DummyClassifier(strategy='prior')
accuracy: 0.993 average_precision: 0.007 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.498
=== new best DummyClassifier(strategy='prior') (using recall_macro):
accuracy: 0.993 average_precision: 0.007 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.498

Running GaussianNB()
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
=== new best GaussianNB() (using recall_macro):
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000

Running MultinomialNB()
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
Running DecisionTreeClassifier(class_weight='balanced', max_depth=1)
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
Running DecisionTreeClassifier(class_weight='balanced', max_depth=5)
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01)
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000)
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 0.997
Running LogisticRegression(class_weight='balanced', max_iter=1000)
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000

Best model:
GaussianNB()
Best Scores:
accuracy: 1.000 average_precision: 1.000 roc_auc: 1.000 recall_macro: 1.000 f1_macro: 1.000
SimpleClassifier(random_state=1234, refit=True, shuffle=True, type_hints=None,
                 verbose=1)

This is able to make predictions without issue, but when I tried running:
dabl.explain(simple_clf)
I got the following warning:


TypeErrorTraceback (most recent call last)
/usr/local/lib/python3.7/site-packages/dabl/explain.py in _extract_inner_estimator(estimator, feature_names)
    235             feature_names = inner_estimator.steps[0][1].get_feature_names(
--> 236                 feature_names)
    237         except TypeError:

TypeError: get_feature_names() takes 1 positional argument but 2 were given

During handling of the above exception, another exception occurred:

ValueErrorTraceback (most recent call last)
<ipython-input-36-37d2322eefb7> in <module>
----> 1 dabl.explain(simple_clf)

/usr/local/lib/python3.7/site-packages/dabl/explain.py in explain(estimator, X_val, y_val, target_col, feature_names, n_top_features)
    123 
    124     inner_estimator, inner_feature_names = _extract_inner_estimator(
--> 125         estimator, feature_names)
    126 
    127     if X_val is not None:

/usr/local/lib/python3.7/site-packages/dabl/explain.py in _extract_inner_estimator(estimator, feature_names)
    236                 feature_names)
    237         except TypeError:
--> 238             feature_names = inner_estimator.steps[0][1].get_feature_names()
    239 
    240         # now we have input feature names for the final step

/usr/local/lib/python3.7/site-packages/dabl/preprocessing.py in get_feature_names(self)
    607                 # FIXME that is really strange?!
    608                 ohe_cols = self.columns_[self.columns_.map(cols)]
--> 609                 feature_names.extend(ohe.get_feature_names(ohe_cols))
    610             elif name == "remainder":
    611                 assert trans == "drop"

/usr/local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in get_feature_names(self, input_features)
    530                 "input_features should have length equal to number of "
    531                 "features ({}), got {}".format(len(self.categories_),
--> 532                                                len(input_features)))
    533 
    534         feature_names = []

ValueError: input_features should have length equal to number of features (16), got 14

My train_df_clean has the following structure:

	continuous	dirty_float	low_card_int	categorical	date	free_string	useless
`id`	False	False	False	False	False	True	False
`wt_ratio`	True	False	False	False	False	False	False
`cat_a`	False	False	False	False	False	True	False
`cat_b`	False	False	False	False	False	True	False
`cat_c`	False	False	False	False	False	True	False
`cat_d`	False	False	False	False	False	True	False
`cat_e`	False	False	False	True	False	False	False
`cat_f`	False	False	False	True	False	False	False
`cat_g`	False	False	False	True	False	False	False
`alt_id`	False	False	False	False	False	True	False
`type_1`	False	False	False	False	False	True	False
`type_2`	False	False	False	False	False	True	False
`class`	False	False	False	True	False	False	False
`subclass_1`	False	False	False	True	False	False	False
`subclass_2`	False	False	False	False	False	True	False
`in_cap`	False	False	False	False	False	True	False
`tax_1`	False	False	False	True	False	False	False
`tax_2`	False	False	False	True	False	False	False
`col`	False	False	False	False	False	True	False
`cap_loc`	False	False	False	True	False	False	False
'grp_1`	False	False	False	True	False	False	False
`grp_2`	False	False	False	False	False	True	False
`grp_3`	False	False	False	False	False	True	False
`temp`	False	False	False	False	False	True	False
`notes_1`	False	False	False	False	False	True	False
`notes_2`	False	False	False	False	False	True	False
`meas_1`	True	False	False	False	False	False	False
`meas_2`	True	False	False	False	False	False	False
`meas_3`	True	False	False	False	False	False	False
`meas_4`	False	False	False	True	False	False	False
`len`	True	False	False	False	False	False	False
`loc_area1`	False	False	False	True	False	False	False
`loc_area2`	False	False	False	True	False	False	False
`loc_area3`	False	False	False	True	False	False	False
`loc_area4`	False	False	False	True	False	False	False

And I can see that I have 14 categories which seems to align with the count in the error, but I would have assumed that the GaussianNB would also use the continuous fields.

Thank you in advance!

The text was updated successfully, but these errors were encountered:

Luerken · 2020-09-24T17:50:20Z

I have done more manual digging and determined that this issue appears to lie with a categorical value being turned into a continuous RV. When pulling other features out, a Decision Tree was the best model and the dabl.explain function worked without issue. However, once the NB model won out, it broke the function. Additionally, there was interesting behavior when nan values existed in the columns.

amueller · 2020-09-25T15:47:57Z

Thanks for reporting! Indeed the explain function is not very robust yet.
Scikit-learn makes mapping input to output columns a bit hard, which will hopefully be improved by scikit-learn/scikit-learn#16772

I'll see what I can do in the meantime, dabl also needs some updates for the current version on sklearn, which I'll probably try to make work first.

Luerken · 2020-09-28T17:20:20Z

Awesome, thank you! I was planning on looking through the code to try and understand more as well. But thank you for your work on this! I think it is a quite cool library.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug with dabl.explain() #258

Bug with dabl.explain() #258

Luerken commented Sep 23, 2020

Luerken commented Sep 24, 2020

amueller commented Sep 25, 2020

Luerken commented Sep 28, 2020

Bug with dabl.explain() #258

Bug with dabl.explain() #258

Comments

Luerken commented Sep 23, 2020

Luerken commented Sep 24, 2020

amueller commented Sep 25, 2020

Luerken commented Sep 28, 2020