Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn throws ValueError exception #333

Open
yoid2000 opened this issue Mar 15, 2023 · 2 comments
Open

sklearn throws ValueError exception #333

yoid2000 opened this issue Mar 15, 2023 · 2 comments
Labels
bug Something isn't working under discussion Issue is currently being discussed

Comments

@yoid2000
Copy link

Problem Description

I am working with a home-grown synthesizer that is able to synthesize relatively rare categorical values (i.e. one that occurs maybe 3 or 4 times in a table of thousands of rows).

This is all fine and good, but a problem I have is that when I want to run a model (say sdmetrics.single_table.LinearRegression.compute() on the synthetic data, it can occasionally happen that no instances of that value show up in the test data (randomly sampled from the original data), whereas some instances of that value show up in the training data (randomly sampled from the synthesized data).

This in turn causes the ML Efficacy measures to fault with a message like this:

ValueError: Found unknown categories ['fake'] in column 0 during transform

This can be avoided by setting handle_unknown='ignore' in the sklearn encoders (i.e. enc = OneHotEncoder(handle_unknown='ignore') in def fit(self, data): in class HyperTransformer():).

Unfortunately there is no way to set the handle_unknown parameter from sdmetrics. As a result, there is no way for me to complete these measures (short of hard-coding the parameter in sklearn itself). I could probably to a try-except around the efficacy measure, but this still doesn't allow the measure itself to complete.

Expected behavior

Allow the handle_unknown flag to be specified in the model.compute() calls. (Either explicitly or allowing some kind of parameter pass-through to sklean.)

@yoid2000 yoid2000 added feature request Request for a new feature new Label applied to new issues labels Mar 15, 2023
@npatki npatki transferred this issue from sdv-dev/SDV Apr 6, 2023
@npatki
Copy link
Contributor

npatki commented Apr 6, 2023

Hi @yoid2000, I transferred this issue into SDMetrics as this is the underlying library that implements the metric.

I can replicate this error and will classify this as a bug.

The expectation is that the training data does contain all possible values, since this is crucial information for forming the Linear Regression model. I agree that it should be ok if the test data does not contain all possible category values.

Root Cause

This error seems to be related to #291. It appears that the transformation (preprocessing) is using the wrong dataset to fit.

Observed: The code is fitting the transformers on the test_data and then applying this to the train_data. That's why it's expecting all categories to be in the test data.

Expected: The code should fit on the train_data and then apply it to the test_data. We expect all categories to be present during training but it does not matter during test.

@npatki npatki added bug Something isn't working under discussion Issue is currently being discussed and removed feature request Request for a new feature new Label applied to new issues labels Apr 6, 2023
@iamamiramine
Copy link

Any updates? I am facing the same issue. Any workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

3 participants