Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Classification metrics do not support label names containing numbers #1085

Open
florianbodr opened this issue Apr 29, 2024 · 0 comments

Comments

@florianbodr
Copy link

In Evidently 0.4.19 with Python 3.10, the ClassificationQualityMetric() and ClassificationConfusionMatrix() (these are the one I tested but i suspect other metrics to be impacted) throw an error when some data labels contain numerical values. Even if the dataframe column type is specified as string.
See sample code below:

from evidently.report import Report
from evidently.metrics import *
import pandas as pd

label_target = ['foo', 'bar', 'fun', 'foo', 'fun', 'foo', '101', '102']
label_predict = ['foo', 'bar', 'fun', 'bar', 'fun', 'fun', '101', '101']
data_df = pd.DataFrame({'target': label_target, 'prediction': label_predict}, dtype="string")

report = Report(metrics=[
    ClassificationQualityMetric(),
    ClassificationConfusionMatrix(),
])
report.run(reference_data=None, current_data=data_df)
report

It ends up with the following error:

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/evidently/calculations/classification_performance.py:316, in calculate_matrix(target, prediction, labels)
    315 def calculate_matrix(target: pd.Series, prediction: pd.Series, labels: List[Union[str, int]]) -> ConfusionMatrix:
--> 316     sorted_labels = sorted(labels)
    317     matrix = metrics.confusion_matrix(target, prediction, labels=sorted_labels)
    318     return ConfusionMatrix(labels=sorted_labels, values=[row.tolist() for row in matrix])

TypeError: '<' not supported between instances of 'str' and 'int'

Adding a char (like a dot) at the end of the label name numbers fixes the issue:

label_target = ['foo', 'bar', 'fun', 'foo', 'fun', 'foo', '101.', '102-']
label_predict = ['foo', 'bar', 'fun', 'bar', 'fun', 'fun', '101.', '101.']

But I do not think that this is the expected behavior and that the dataframe column type should be respected all along the metric(s) computation.

@florianbodr florianbodr changed the title Bug: Classification metrics does not support label names containing numbers Bug: Classification metrics do not support label names containing numbers Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant