New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redundant whitespace in the demo data #233
Comments
Hi @AndresAlgaba, nice to meet you and thanks for bringing this to our attention. The root cause was probably the way this original data was exported. While your suggestion would solve the issue for this particular demo, we'd prefer to fix the format of the underlying data itself. We can apply the same principle to any future demo datasets that may be slightly off a true csv format (in different ways). I suggest we repurpose this bug for reformatting original demo data as a proper csv file. For now, we can suggest everyone to use your manual workaround of reading the csv with |
Hi @npatki, nice to meet you too! It's my pleasure; thanks to the team for the effort on SDV and the quick response. Yes, I agree. Is there anything which I can help with? I have already opened a PR. |
Unfortunately we don't have public write access to the S3 bucket, which is needed to make this change. We'll add this to our backlog and update the bug when we have a fix. Thanks for your offer to help! |
Okay, thanks, and no problem! |
Hi everyone! First of all, thanks for all the work on this fantastic library and the Synthetic Data Vault in general :). I believe I found a minor bug in loading the demo data set and propose a quick fix for which I will submit a PR.
Environment Details
Error Description
When running the usage example for the
CTGANSynthesizer
with conditional sampling via thecondition_column
andcondition_value
arguments in thesample
method:I get the following error:
rdt\transformers\categorical.py:374: UserWarning: The data contains 1 new categories that were not seen in the original data (examples: {'United-States'}). Creating a vector of all 0s. If you want to model new categories, please fit the transformer again with the new data.
After looking into it, I found out that the discrete variables contain redundant whitespace in front of the categories. Using ' United-States' (with the redundant whitespace) works fine:
Solution
I propose to set the
skipinitialspace
argument in thepd.read_csv
toTrue
in theload_demo
function:This seems to solve the issue.
Steps to reproduce
The text was updated successfully, but these errors were encountered: