Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant whitespace in the demo data #233

Open
AndresAlgaba opened this issue Jul 19, 2022 · 4 comments · May be fixed by #234
Open

Redundant whitespace in the demo data #233

AndresAlgaba opened this issue Jul 19, 2022 · 4 comments · May be fixed by #234
Labels
bug Something isn't working

Comments

@AndresAlgaba
Copy link

Hi everyone! First of all, thanks for all the work on this fantastic library and the Synthetic Data Vault in general :). I believe I found a minor bug in loading the demo data set and propose a quick fix for which I will submit a PR.

Environment Details

  • CTGAN version: latest (0.5.2.dev1)
  • Python version: 3.9.7
  • Operating System: Windows

Error Description

When running the usage example for the CTGANSynthesizer with conditional sampling via the condition_column and condition_value arguments in the sample method:

samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')

I get the following error:
rdt\transformers\categorical.py:374: UserWarning: The data contains 1 new categories that were not seen in the original data (examples: {'United-States'}). Creating a vector of all 0s. If you want to model new categories, please fit the transformer again with the new data.

After looking into it, I found out that the discrete variables contain redundant whitespace in front of the categories. Using ' United-States' (with the redundant whitespace) works fine:

samples = ctgan.sample(1000, condition_column='native-country', condition_value=' United-States')

Solution

I propose to set the skipinitialspace argument in the pd.read_csv to True in the load_demo function:

def load_demo():
    """Load the demo."""
    return pd.read_csv(DEMO_URL, compression='gzip', skipinitialspace=True)

This seems to solve the issue.

Steps to reproduce

from ctgan import CTGANSynthesizer
from ctgan import load_demo

data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGANSynthesizer(epochs=1)
ctgan.fit(data, discrete_columns)

# Synthetic copy
samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')
@AndresAlgaba AndresAlgaba added bug Something isn't working pending review This issue needs to be further reviewed, so work cannot be started labels Jul 19, 2022
@npatki
Copy link
Contributor

npatki commented Jul 19, 2022

Hi @AndresAlgaba, nice to meet you and thanks for bringing this to our attention.

The root cause was probably the way this original data was exported. While your suggestion would solve the issue for this particular demo, we'd prefer to fix the format of the underlying data itself. We can apply the same principle to any future demo datasets that may be slightly off a true csv format (in different ways).

I suggest we repurpose this bug for reformatting original demo data as a proper csv file. For now, we can suggest everyone to use your manual workaround of reading the csv with skipinitialspace=True.

@npatki npatki added under discussion Issue is currently being discussed and removed pending review This issue needs to be further reviewed, so work cannot be started labels Jul 19, 2022
@AndresAlgaba
Copy link
Author

Hi @npatki, nice to meet you too! It's my pleasure; thanks to the team for the effort on SDV and the quick response.

Yes, I agree. Is there anything which I can help with? I have already opened a PR.

@npatki
Copy link
Contributor

npatki commented Jul 21, 2022

Unfortunately we don't have public write access to the S3 bucket, which is needed to make this change. We'll add this to our backlog and update the bug when we have a fix.

Thanks for your offer to help!

@AndresAlgaba
Copy link
Author

Okay, thanks, and no problem!

@npatki npatki removed the under discussion Issue is currently being discussed label Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants