Redundant whitespace in the demo data #233

AndresAlgaba · 2022-07-19T09:38:56Z

Hi everyone! First of all, thanks for all the work on this fantastic library and the Synthetic Data Vault in general :). I believe I found a minor bug in loading the demo data set and propose a quick fix for which I will submit a PR.

Environment Details

CTGAN version: latest (0.5.2.dev1)
Python version: 3.9.7
Operating System: Windows

Error Description

When running the usage example for the CTGANSynthesizer with conditional sampling via the condition_column and condition_value arguments in the sample method:

samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')

I get the following error:
rdt\transformers\categorical.py:374: UserWarning: The data contains 1 new categories that were not seen in the original data (examples: {'United-States'}). Creating a vector of all 0s. If you want to model new categories, please fit the transformer again with the new data.

After looking into it, I found out that the discrete variables contain redundant whitespace in front of the categories. Using ' United-States' (with the redundant whitespace) works fine:

samples = ctgan.sample(1000, condition_column='native-country', condition_value=' United-States')

Solution

I propose to set the skipinitialspace argument in the pd.read_csv to True in the load_demo function:

def load_demo():
    """Load the demo."""
    return pd.read_csv(DEMO_URL, compression='gzip', skipinitialspace=True)

This seems to solve the issue.

Steps to reproduce

from ctgan import CTGANSynthesizer
from ctgan import load_demo

data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGANSynthesizer(epochs=1)
ctgan.fit(data, discrete_columns)

# Synthetic copy
samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')

The text was updated successfully, but these errors were encountered:

npatki · 2022-07-19T15:58:07Z

Hi @AndresAlgaba, nice to meet you and thanks for bringing this to our attention.

The root cause was probably the way this original data was exported. While your suggestion would solve the issue for this particular demo, we'd prefer to fix the format of the underlying data itself. We can apply the same principle to any future demo datasets that may be slightly off a true csv format (in different ways).

I suggest we repurpose this bug for reformatting original demo data as a proper csv file. For now, we can suggest everyone to use your manual workaround of reading the csv with skipinitialspace=True.

AndresAlgaba · 2022-07-20T11:57:38Z

Hi @npatki, nice to meet you too! It's my pleasure; thanks to the team for the effort on SDV and the quick response.

Yes, I agree. Is there anything which I can help with? I have already opened a PR.

npatki · 2022-07-21T14:56:08Z

Unfortunately we don't have public write access to the S3 bucket, which is needed to make this change. We'll add this to our backlog and update the bug when we have a fix.

Thanks for your offer to help!

AndresAlgaba · 2022-07-24T19:10:50Z

Okay, thanks, and no problem!

AndresAlgaba added bug Something isn't working pending review This issue needs to be further reviewed, so work cannot be started labels Jul 19, 2022

This was referenced Jul 19, 2022

Remove redundant whitespace: Issue 233 #234

Open

Conditional sampling and cross-entropy loss #235

Open

npatki added under discussion Issue is currently being discussed and removed pending review This issue needs to be further reviewed, so work cannot be started labels Jul 19, 2022

npatki removed the under discussion Issue is currently being discussed label Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redundant whitespace in the demo data #233

Redundant whitespace in the demo data #233

AndresAlgaba commented Jul 19, 2022

npatki commented Jul 19, 2022

AndresAlgaba commented Jul 20, 2022

npatki commented Jul 21, 2022

AndresAlgaba commented Jul 24, 2022

Redundant whitespace in the demo data #233

Redundant whitespace in the demo data #233

Comments

AndresAlgaba commented Jul 19, 2022

Environment Details

Error Description

Solution

Steps to reproduce

npatki commented Jul 19, 2022

AndresAlgaba commented Jul 20, 2022

npatki commented Jul 21, 2022

AndresAlgaba commented Jul 24, 2022