Binary Classification metric fails with unknown category (`ValueError`) #260

josalhor · 2023-11-11T12:12:04Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDGym version: sdgym-0.7.0
Python version: 3.9 (probably Any)
Operating System: Linux-like (probably Any)

Error Description

I am trying to run this:

METRICS  = [
    ('BinaryDecisionTreeClassifier', {
        'target': 'label',
    }),
   ...
]
rs = sdgym.benchmark_single_table(
        synthesizers=['TVAESynthesizer'],
        show_progress=True,
        sdv_datasets=['adult', 'census', 'intrusion'],
        sdmetrics=METRICS,
    )

This produces the following error:

  0%|          | 0/1 [00:00<?, ?it/s]Metric BinaryDecisionTreeClassifier failed on dataset adult. Skipping.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/sdgym/benchmark.py", line 166, in _compute_scores
    score = metric.compute(*metric_args, **metric_kwargs.get(metric_name, {}))
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/base.py", line 127, in compute
    predictions = cls._fit_predict(train_data, train_target, test_data, test_target)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/binary.py", line 37, in _fit_predict
    return super()._fit_predict(train_data, train_target, test_data, test_target)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/base.py", line 51, in _fit_predict
    test_data = ht.transform(test_data)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/utils.py", line 200, in transform
    out = transform_info['one_hot_encoder'].transform(col_data).toarray()
  File "/usr/local/lib/python3.9/dist-packages/sklearn/preprocessing/_encoders.py", line 882, in transform
    X_int, X_mask = self._transform(
  File "/usr/local/lib/python3.9/dist-packages/sklearn/preprocessing/_encoders.py", line 160, in _transform
    raise ValueError(msg)
ValueError: Found unknown categories ['Never-worked'] in column 0 during transform
Metric BinaryAdaBoostClassifier failed on dataset adult. Skipping.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/sdgym/benchmark.py", line 166, in _compute_scores
    score = metric.compute(*metric_args, **metric_kwargs.get(metric_name, {}))
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/base.py", line 127, in compute
    predictions = cls._fit_predict(train_data, train_target, test_data, test_target)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/binary.py", line 37, in _fit_predict
    return super()._fit_predict(train_data, train_target, test_data, test_target)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/base.py", line 51, in _fit_predict
    test_data = ht.transform(test_data)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/utils.py", line 200, in transform
    out = transform_info['one_hot_encoder'].transform(col_data).toarray()
  File "/usr/local/lib/python3.9/dist-packages/sklearn/preprocessing/_encoders.py", line 882, in transform
    X_int, X_mask = self._transform(
  File "/usr/local/lib/python3.9/dist-packages/sklearn/preprocessing/_encoders.py", line 160, in _transform
    raise ValueError(msg)
...
  0%|          | 0/1 [00:00<?, ?it/s]Metric BinaryDecisionTreeClassifier failed on dataset census. Skipping.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/sdgym/benchmark.py", line 166, in _compute_scores
    score = metric.compute(*metric_args, **metric_kwargs.get(metric_name, {}))
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/base.py", line 127, in compute
    predictions = cls._fit_predict(train_data, train_target, test_data, test_target)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/binary.py", line 37, in _fit_predict
    return super()._fit_predict(train_data, train_target, test_data, test_target)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/base.py", line 51, in _fit_predict
    test_data = ht.transform(test_data)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/utils.py", line 200, in transform
    out = transform_info['one_hot_encoder'].transform(col_data).toarray()
  File "/usr/local/lib/python3.9/dist-packages/sklearn/preprocessing/_encoders.py", line 882, in transform
    X_int, X_mask = self._transform(
  File "/usr/local/lib/python3.9/dist-packages/sklearn/preprocessing/_encoders.py", line 160, in _transform
    raise ValueError(msg)
ValueError: Found unknown categories ['Job leaver'] in column 0 during transform
...
  0%|          | 0/1 [00:00<?, ?it/s]Metric BinaryDecisionTreeClassifier failed on dataset intrusion. Skipping.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/sdgym/benchmark.py", line 166, in _compute_scores
    score = metric.compute(*metric_args, **metric_kwargs.get(metric_name, {}))
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/base.py", line 127, in compute
    predictions = cls._fit_predict(train_data, train_target, test_data, test_target)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/binary.py", line 37, in _fit_predict
    return super()._fit_predict(train_data, train_target, test_data, test_target)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/single_table/efficacy/base.py", line 51, in _fit_predict
    test_data = ht.transform(test_data)
  File "/usr/local/lib/python3.9/dist-packages/sdmetrics/utils.py", line 200, in transform
    out = transform_info['one_hot_encoder'].transform(col_data).toarray()
  File "/usr/local/lib/python3.9/dist-packages/sklearn/preprocessing/_encoders.py", line 882, in transform
    X_int, X_mask = self._transform(
  File "/usr/local/lib/python3.9/dist-packages/sklearn/preprocessing/_encoders.py", line 160, in _transform
    raise ValueError(msg)
ValueError: Found unknown categories ['supdup', 'ftp', 'mtp', 'gopher', 'hostnames', 'rje', 'whois', 'vmnet', 'systat', 'link', 'iso_tsap', 'exec', 'bgp', 'echo', 'ldap', 'ctf', 'netstat', 'name', 'tim_i', 'courier', 'kshell', 'netbios_ssn', 'uucp', 'remote_job', 'uucp_path', 'urh_i', 'daytime', 'sunrpc', 'red_i', 'klogin', 'login', 'pop_2', 'csnet_ns', 'http_443', 'nnsp', 'tftp_u', 'auth', 'shell', 'Z39_50', 'pm_dump', 'netbios_ns', 'imap4', 'time', 'discard', 'ssh', 'pop_3', 'netbios_dgm', 'domain', 'nntp', 'sql_net'] in column 0 during transform

Steps to reproduce

Just running the above snippet produces the output.

The text was updated successfully, but these errors were encountered:

npatki · 2023-11-13T19:32:08Z

Hi @josalhor, thanks for filing this issue with all the details. Our investigation showed that this issue is probably not related to TVAE, as it is possible to replicate this same error with different synthesizer such as Gaussian Copula.

Root Cause

The BinaryDecisionTreeClassifier metric cannot be run on certain combinations of real/synthetic data.

The metric is designed to take the following steps:

Train the ML model using the synthetic data
Test the ML model using the real data

The problem is that the synthetic data may not have full coverage of all the possible categories. For example, assume only 0.1% of the real data had a particular category value such as 'supdup'. It's possible (due to random chance) that none of the the synthetic data has this value. In this case, the Binary Classification algorithm messes up because the value is seen for the first time during testing.

For more info about the metric, see the API docs.

Next Steps

I'm updating the title of this issue to reflect the findings.

I've also started a new feature request in the underlying SDMetrics library: sdv-dev/SDMetrics#515. We can continue our discussion there.

In the meantime, I wonder if any other metric will be suitable for your purposes? (The Binary Classification metrics are listed as "in Beta" by the SDMetrics docs.)

josalhor · 2023-11-13T19:40:46Z

Your description of the problem makes a lot of sense and matches my findings.

In the meantime, I wonder if any other metric will be suitable for your purposes? (The Binary Classification metrics are listed as "in Beta" by the SDMetrics docs.)

Actually, I was trying my best to replicate the CTGAN paper results, so I will take a look at the error and try to patch if possible.

I've also started a new feature request in the underlying SDMetrics library: sdv-dev/SDMetrics#515. We can continue our discussion there.

I'll write further comments in that issue.

josalhor added bug Something isn't working new Automatic label applied to new issues labels Nov 11, 2023

npatki changed the title ~~TVAE unkown category~~ Binary Classification metric fails with unknown category (ValueError) Nov 13, 2023

npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary Classification metric fails with unknown category (`ValueError`) #260

Binary Classification metric fails with unknown category (`ValueError`) #260

josalhor commented Nov 11, 2023

npatki commented Nov 13, 2023

josalhor commented Nov 13, 2023

Binary Classification metric fails with unknown category (ValueError) #260

Binary Classification metric fails with unknown category (ValueError) #260

Comments

josalhor commented Nov 11, 2023

Environment Details

Error Description

Steps to reproduce

npatki commented Nov 13, 2023

Root Cause

Next Steps

josalhor commented Nov 13, 2023

Binary Classification metric fails with unknown category (`ValueError`) #260

Binary Classification metric fails with unknown category (`ValueError`) #260