Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParSynthesizer trying to allocate an absurd about of memory for a small dataset #2012

Open
JonathanBhimani-Burrows opened this issue May 16, 2024 · 3 comments
Labels
bug Something isn't working data:sequential Related to timeseries datasets under discussion Issue is currently being discussed

Comments

@JonathanBhimani-Burrows

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.12.1
  • Python version: 3.10
  • Operating System: Google Colab

Error Description

I'm trying to use the parsynthesizer to create synthetic data to improve model performance. However, it seems to have some serious issues when trying to fit the data. I might be missing something, but here is the setup

{
"columns": {
"User_Lookup_Id": {
"sdtype": "id"
},
"Revenue_Date": {
"sdtype": "datetime",
"datetime_format": "%Y-%m-%d"
},
"Revenue_Amount": {
"sdtype": "numerical"
},
"User_First_Name": {
"pii": false,
"sdtype": "first_name"
},
"Gender": {
"sdtype": "categorical"
},
"Address_City": {
"sdtype": "categorical"
},
"Primary_Address": {
"sdtype": "categorical"
},
"Average_Income": {
"sdtype": "numerical"
},
"Social_Group_Name": {
"sdtype": "categorical"
},
"Spouse": {
"sdtype": "categorical"
},
"Active_Email": {
"pii": false,
"sdtype": "email"
},
"dummy_income": {
"sdtype": "numerical",
"computer_representation": "Float"
}
},

context columns are ['User_First_Name', 'Address_City','Gender','Primary_Address','Social_Group_Name','Spouse_Is_Active','Active_Email', 'dummy_income']

synthesizer = PARSynthesizer(metadata, verbose=True,context_columns=context_cols, enforce_min_max_values =True)

When I try to run the model, it tries to allocate 451 GB GPU for 168k rows, which is absurd
Segment size seems alleviate this, but it limits the model to only producing segments of len segment_size, which is problematic if you have sequences > segment size
(I can't upload the data for confidentiality reasons unfortunately)

Is there something I'm missing? Is this expected behavior? Cause if so, some optimization is necessary, as 168k rows is not very much to train the model (full dataset is 28 mil). Is there not a batch size parameter that could be configured for this model?

Thanks for your help

Steps to reproduce

<Replace this text with a description of the steps that anyone can follow to reproduce the error. If the error happens only on a specific dataset, please consider attaching some example data to the issue so that others can use it to reproduce the error.>

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
@JonathanBhimani-Burrows JonathanBhimani-Burrows added bug Something isn't working new Automatic label applied to new issues labels May 16, 2024
@npatki
Copy link
Contributor

npatki commented May 20, 2024

Hi @JonathanBhimani-Burrows thanks for reaching out, and from sharing your metadata. One thing I notice from your metadata is that some of the columns are marked as categorical or with PII as False. To run SDV well, I think the metadata should be updated.

Brief description of how SDV works:

  • There are some attributes (columns) that SDV will use to learn important patterns, correlations etc. Usually these are attributes that are statistical in nature, such as dollar amounts, dates, discrete categories, etc.
  • There are other attributes (eg. names, addresses, etc.) that probably don't make sense for learning patterns, as they usually are private (PII) and they don't contain statistical information. The SDV will anonymize such attributes if you mark them with the correct sdtype and PII as True.

Changes I would make to your metadata: The following columns should probably be anonymized --

  1. First name and email should have pii set to True to anonymize them instead of trying to learn patterns within them
  2. Primary address should be sdtype 'address' or 'street_address' with pii also set to True.

Resources:

@npatki npatki added data:sequential Related to timeseries datasets under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels May 20, 2024
@JonathanBhimani-Burrows
Copy link
Author

Thanks for the reply, but this didn't really answer the question
The decision of making both first name and primary address with pii = False, was a design decision. I want both of those to be used to help determine whatever output I need
Having said that, back to the original discussion, is this expected behavior? Does the model genuinely use up an enormous amount of VRAM to instantiate? Is there no option for batch sizes?

@npatki
Copy link
Contributor

npatki commented May 28, 2024

Hi @JonathanBhimani-Burrows you are welcome.

 is this expected behavior

I cannot give you an answer without learning more. Metadata is intrinsically related to performance, and I’ve seen multiple cases where a PII/categorical mixup has led to issues. I appreciate you sharing that these columns are meant to be categorical, something I’m curious to know more about.

I suspect that the columns you’ve marked as categorical may be high cardinality, which is known to cause issues (expected). If you are willing to entertain an experiment, updating them to PII like I previously mentioned will help verify (or rule out) this guess — regardless of what your intended usage may be.

Does the model genuinely use up an enormous amount of VRAM to instantiate? 

My experience is that PAR is usable with a local machine’s RAM in many cases, especially when it's set up with PII columns and a sequence_index.

But we are aware of some performance issues popping up over time in #1965, and we are working to uncover root cause(s). I see you’ve also replied there.

 Is there no option for batch sizes?

All available parameters are documented in our website. We keep our docs up-to-date so you are already referencing those, then you are at the right place!

Just based on the algorithm definition of PAR, batching within a sequence is not trivial. For more info, see Preprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants