sklearn throws ValueError exception #333

yoid2000 · 2023-03-15T06:45:56Z

Problem Description

I am working with a home-grown synthesizer that is able to synthesize relatively rare categorical values (i.e. one that occurs maybe 3 or 4 times in a table of thousands of rows).

This is all fine and good, but a problem I have is that when I want to run a model (say sdmetrics.single_table.LinearRegression.compute() on the synthetic data, it can occasionally happen that no instances of that value show up in the test data (randomly sampled from the original data), whereas some instances of that value show up in the training data (randomly sampled from the synthesized data).

This in turn causes the ML Efficacy measures to fault with a message like this:

ValueError: Found unknown categories ['fake'] in column 0 during transform

This can be avoided by setting handle_unknown='ignore' in the sklearn encoders (i.e. enc = OneHotEncoder(handle_unknown='ignore') in def fit(self, data): in class HyperTransformer():).

Unfortunately there is no way to set the handle_unknown parameter from sdmetrics. As a result, there is no way for me to complete these measures (short of hard-coding the parameter in sklearn itself). I could probably to a try-except around the efficacy measure, but this still doesn't allow the measure itself to complete.

Expected behavior

Allow the handle_unknown flag to be specified in the model.compute() calls. (Either explicitly or allowing some kind of parameter pass-through to sklean.)

The text was updated successfully, but these errors were encountered:

npatki · 2023-04-06T16:44:37Z

Hi @yoid2000, I transferred this issue into SDMetrics as this is the underlying library that implements the metric.

I can replicate this error and will classify this as a bug.

The expectation is that the training data does contain all possible values, since this is crucial information for forming the Linear Regression model. I agree that it should be ok if the test data does not contain all possible category values.

Root Cause

This error seems to be related to #291. It appears that the transformation (preprocessing) is using the wrong dataset to fit.

Observed: The code is fitting the transformers on the test_data and then applying this to the train_data. That's why it's expecting all categories to be in the test data.

Expected: The code should fit on the train_data and then apply it to the test_data. We expect all categories to be present during training but it does not matter during test.

iamamiramine · 2024-04-26T12:04:10Z

Any updates? I am facing the same issue. Any workaround?

yoid2000 added feature request Request for a new feature new Label applied to new issues labels Mar 15, 2023

npatki transferred this issue from sdv-dev/SDV Apr 6, 2023

npatki added bug Something isn't working under discussion Issue is currently being discussed and removed feature request Request for a new feature new Label applied to new issues labels Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sklearn throws ValueError exception #333

sklearn throws ValueError exception #333

yoid2000 commented Mar 15, 2023

npatki commented Apr 6, 2023

iamamiramine commented Apr 26, 2024

sklearn throws ValueError exception #333

sklearn throws ValueError exception #333

Comments

yoid2000 commented Mar 15, 2023

Problem Description

Expected behavior

npatki commented Apr 6, 2023

Root Cause

iamamiramine commented Apr 26, 2024