added dropna to avoid crash on nan values #275

AlexanderZender · 2023-07-14T11:16:25Z

added copy to any predict or predict_proba call to only pass definitiv copies.
fixed issue when categorical features have nan as values, a crash occurs
added the option to allow nan category values in frontend

issue: #273

added copy to any predict or predict_proba call to only pass definitiv copies

AlexanderZender · 2023-07-14T13:40:58Z

During the initial opening sequence of the dashboard, a prediction is made using all nan values:

This causes an exception in the console that the encoder of the model can not deal with the passed values.
This is accurate as e.g. Name is a string and is passed -999.
This initial exception does not impact the dashboard as it seems, I do not know if this intended or not to always do 'nan' predictions during the initial opening of the dashboard.
But I wanted to note it here.

AlexanderZender · 2023-07-14T13:43:01Z

I added a 'NaN' category value to the categorical_dict if the categorical feature has NaN as values.
Without this, the model encoder will break as always -999 is passed, instead of NaN which is a known/allowed value.

oegedijk · 2023-08-01T11:41:29Z

I'm not sure why we are adding all those .copy() to the prediction calls? prediction shouldn't have any side effects on X, so why is this needed?

oegedijk · 2023-08-01T11:43:26Z

could you add some tests that shows this works?

AlexanderZender · 2023-08-01T11:44:53Z

The copy is technically not needed but it will avoid errors, for example if you pass a pipeline that manipulate the incoming dataframe within its process.
These changes are indirectly applied to the dataframe within the explainer dashboard. The next time explainer dashboard will call the predict proba function, it will use the manipulated dataframe.
This will cause a mismatch within the pipeline of the model, as the dataframe does not match.

You can let the user manage that, but it may not always be possible to perform a copy command within the pipeline or own code.

Edit: as your comment asked about the predict function, I was unsure if this problem also occurred there.
I only observed the predict_proba function causing issues but maybe there it is unnecessary with predict? I can remove the copies for predict.

removed copy from predict function calls, added test for testing categorical labels

AlexanderZender · 2023-08-01T13:48:20Z

@oegedijk I added the requested test, and removed copy from all predict function calls.
I added a test to test categorical labels, this will fail as explainer dashboard does not currently support these.

I'm not completely sure how many changes are needed to allow categorical labels, but the test is already there with a dataset to test with. The dataset is an adjusted subset of the car dataset from OpenML
Maybe this is something to look into, as there are numerous datasets with categorical labels, and as i can see explainer dashboard already converts numerical labels to strings (?)

…egorical test

oegedijk · 2023-08-02T19:03:19Z

I added the requested test, and removed copy from all predict function calls.

I guess all these predict functions are only predicting a single row, so maybe the cost of adding these copy's is not so bad? Alternative would be only doing them for pipelines, but that would also introduce additional overhead.

oegedijk · 2023-08-02T19:10:34Z

I added a test to test categorical labels, this will fail as explainer dashboard does not currently support these.

I think it should be possible to reuse the titanic test set? Just replace the 0,1 labels with 'survived', 'not survived'. For testing the NaN's, we could just randomly sprinkle some NaNs in. there.

What would be the fastest/cheapest model that works well with missing values to minimize test time? (HistGradientBoostingClassifier comes to min, but maybe there are faster options)

AlexanderZender · 2023-08-03T06:42:36Z

I added the requested test, and removed copy from all predict function calls.

I guess all these predict functions are only predicting a single row, so maybe the cost of adding these copy's is not so bad? Alternative would be only doing them for pipelines, but that would also introduce additional overhead.

I think the overhead depends on the complexity of the dataset that gets used.
When is the predict function used by the explainerdashboard?

AlexanderZender · 2023-08-03T06:49:05Z

I added a test to test categorical labels, this will fail as explainer dashboard does not currently support these.

I think it should be possible to reuse the titanic test set? Just replace the 0,1 labels with 'survived', 'not survived'. For testing the NaN's, we could just randomly sprinkle some NaNs in. there.

What would be the fastest/cheapest model that works well with missing values to minimize test time? (HistGradientBoostingClassifier comes to min, but maybe there are faster options)

That would be possible if this does not cause issues with other tests.
As I'm not familiar with the entire test suit, I avoid changing it, but I can adapt my tests.

As for the training time, I just used a RandomForestClassifier which trains in less than a second on my system.
In my "pipeline" NaN values are not passed to the model, as the one hot encoder ignores them.
The test checks if NaN in the original dataset do not break the dashboard, what the pipeline does with the NaN is irrelevant.

AlexanderZender · 2023-08-03T07:22:00Z

I updated the test to use the available Titanic data.
I perform the required manipulation in the test, e.g. adding NaN in Name or LabelEncode the label

oegedijk · 2023-08-03T12:32:41Z

As for the training time, I just used a RandomForestClassifier which trains in less than a second on my system.
In my "pipeline" NaN values are not passed to the model, as the one hot encoder ignores them.
The test checks if NaN in the original dataset do not break the dashboard, what the pipeline does with the NaN is irrelevant.

Ah, apologies. I didn't fully understand then, I thought you were thinking of algorithms that are able to deal with missing values (such as e.g. HistGradientBoostingClassifier), but you mean pipelines that fill in NaNs. Both might be something we should be able to support.

oegedijk · 2023-08-03T12:34:50Z

Will have a closer look at the code hopefully tomorrow or this weekend. But thanks already for this contribution, let's try to get it ready and released quickly!

AlexanderZender · 2023-08-03T14:18:41Z

Will have a closer look at the code hopefully tomorrow or this weekend. But thanks already for this contribution, let's try to get it ready and released quickly!

Sounds good, tell me if you find any issue with it.
Will you have time to take a look at the categorical labels? I think it would be great to merge this together

oegedijk · 2023-08-06T11:48:43Z

explainerdashboard/explainer_methods.py

@@ -572,7 +572,7 @@ def one_vs_all_metric(metric, pos_label, y_true, y_pred):
    sign = 1 if greater_is_better else -1

    def _scorer(clf, X, y):
-        y_pred = clf.predict_proba(X)
+        y_pred = clf.predict_proba(X.copy())


I don't think this is needed as we already have the .copy in line 654

AlexanderZender · 2023-08-10T11:30:14Z

You are correct, i removed it

…-manipulation-prevention-by-models

AlexanderZender · 2023-11-03T12:14:24Z

@oegedijk Totally forgot that there was a conflict, ups

AlexanderZender · 2024-02-02T12:38:21Z

@oegedijk Any info when this would be looked at?

AlexanderZender added 2 commits July 14, 2023 13:08

added dropna to avoid crash on nan values

2541e27

added copy to any predict or predict_proba call to only pass definitiv copies

added nan value to categorical features

723b4b1

added conversion for string NaN from frontend

be6cfc0

added test for nan categorical

b318e04

removed copy from predict function calls, added test for testing categorical labels

added more acc classes in dataset and dashboard generation in NaN cat…

6d052b7

…egorical test

AlexanderZender added 2 commits August 3, 2023 09:18

changed used dataset to titanic

3a9374f

removed one hot encoder

f98eae5

oegedijk reviewed Aug 6, 2023

View reviewed changes

removed unecessary copy

1f459a7

Merge branch 'master' into fix-for-nan-in-categorical-types-and-value…

5359bc0

…-manipulation-prevention-by-models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added dropna to avoid crash on nan values #275

added dropna to avoid crash on nan values #275

AlexanderZender commented Jul 14, 2023 •

edited

AlexanderZender commented Jul 14, 2023

AlexanderZender commented Jul 14, 2023

oegedijk commented Aug 1, 2023

oegedijk commented Aug 1, 2023

AlexanderZender commented Aug 1, 2023 •

edited

AlexanderZender commented Aug 1, 2023 •

edited

oegedijk commented Aug 2, 2023

oegedijk commented Aug 2, 2023

AlexanderZender commented Aug 3, 2023

AlexanderZender commented Aug 3, 2023

AlexanderZender commented Aug 3, 2023

oegedijk commented Aug 3, 2023

oegedijk commented Aug 3, 2023

AlexanderZender commented Aug 3, 2023

oegedijk Aug 6, 2023

AlexanderZender commented Aug 10, 2023

AlexanderZender commented Nov 3, 2023

AlexanderZender commented Feb 2, 2024

added dropna to avoid crash on nan values #275

Are you sure you want to change the base?

added dropna to avoid crash on nan values #275

Conversation

AlexanderZender commented Jul 14, 2023 • edited

AlexanderZender commented Jul 14, 2023

AlexanderZender commented Jul 14, 2023

oegedijk commented Aug 1, 2023

oegedijk commented Aug 1, 2023

AlexanderZender commented Aug 1, 2023 • edited

AlexanderZender commented Aug 1, 2023 • edited

oegedijk commented Aug 2, 2023

oegedijk commented Aug 2, 2023

AlexanderZender commented Aug 3, 2023

AlexanderZender commented Aug 3, 2023

AlexanderZender commented Aug 3, 2023

oegedijk commented Aug 3, 2023

oegedijk commented Aug 3, 2023

AlexanderZender commented Aug 3, 2023

oegedijk Aug 6, 2023

Choose a reason for hiding this comment

AlexanderZender commented Aug 10, 2023

AlexanderZender commented Nov 3, 2023

AlexanderZender commented Feb 2, 2024

AlexanderZender commented Jul 14, 2023 •

edited

AlexanderZender commented Aug 1, 2023 •

edited

AlexanderZender commented Aug 1, 2023 •

edited