ParSynthesizer trying to allocate an absurd about of memory for a small dataset #2012

JonathanBhimani-Burrows · 2024-05-16T15:17:35Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 1.12.1
Python version: 3.10
Operating System: Google Colab

Error Description

I'm trying to use the parsynthesizer to create synthetic data to improve model performance. However, it seems to have some serious issues when trying to fit the data. I might be missing something, but here is the setup

{
"columns": {
"User_Lookup_Id": {
"sdtype": "id"
},
"Revenue_Date": {
"sdtype": "datetime",
"datetime_format": "%Y-%m-%d"
},
"Revenue_Amount": {
"sdtype": "numerical"
},
"User_First_Name": {
"pii": false,
"sdtype": "first_name"
},
"Gender": {
"sdtype": "categorical"
},
"Address_City": {
"sdtype": "categorical"
},
"Primary_Address": {
"sdtype": "categorical"
},
"Average_Income": {
"sdtype": "numerical"
},
"Social_Group_Name": {
"sdtype": "categorical"
},
"Spouse": {
"sdtype": "categorical"
},
"Active_Email": {
"pii": false,
"sdtype": "email"
},
"dummy_income": {
"sdtype": "numerical",
"computer_representation": "Float"
}
},

context columns are ['User_First_Name', 'Address_City','Gender','Primary_Address','Social_Group_Name','Spouse_Is_Active','Active_Email', 'dummy_income']

synthesizer = PARSynthesizer(metadata, verbose=True,context_columns=context_cols, enforce_min_max_values =True)

When I try to run the model, it tries to allocate 451 GB GPU for 168k rows, which is absurd
Segment size seems alleviate this, but it limits the model to only producing segments of len segment_size, which is problematic if you have sequences > segment size
(I can't upload the data for confidentiality reasons unfortunately)

Is there something I'm missing? Is this expected behavior? Cause if so, some optimization is necessary, as 168k rows is not very much to train the model (full dataset is 28 mil). Is there not a batch size parameter that could be configured for this model?

Thanks for your help

Steps to reproduce

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

The text was updated successfully, but these errors were encountered:

npatki · 2024-05-20T14:35:45Z

Hi @JonathanBhimani-Burrows thanks for reaching out, and from sharing your metadata. One thing I notice from your metadata is that some of the columns are marked as categorical or with PII as False. To run SDV well, I think the metadata should be updated.

Brief description of how SDV works:

There are some attributes (columns) that SDV will use to learn important patterns, correlations etc. Usually these are attributes that are statistical in nature, such as dollar amounts, dates, discrete categories, etc.
There are other attributes (eg. names, addresses, etc.) that probably don't make sense for learning patterns, as they usually are private (PII) and they don't contain statistical information. The SDV will anonymize such attributes if you mark them with the correct sdtype and PII as True.

Changes I would make to your metadata: The following columns should probably be anonymized --

First name and email should have pii set to True to anonymize them instead of trying to learn patterns within them
Primary address should be sdtype 'address' or 'street_address' with pii also set to True.

Resources:

JonathanBhimani-Burrows · 2024-05-22T17:40:05Z

Thanks for the reply, but this didn't really answer the question
The decision of making both first name and primary address with pii = False, was a design decision. I want both of those to be used to help determine whatever output I need
Having said that, back to the original discussion, is this expected behavior? Does the model genuinely use up an enormous amount of VRAM to instantiate? Is there no option for batch sizes?

npatki · 2024-05-28T14:10:28Z

Hi @JonathanBhimani-Burrows you are welcome.

is this expected behavior

I cannot give you an answer without learning more. Metadata is intrinsically related to performance, and I’ve seen multiple cases where a PII/categorical mixup has led to issues. I appreciate you sharing that these columns are meant to be categorical, something I’m curious to know more about.

I suspect that the columns you’ve marked as categorical may be high cardinality, which is known to cause issues (expected). If you are willing to entertain an experiment, updating them to PII like I previously mentioned will help verify (or rule out) this guess — regardless of what your intended usage may be.

Does the model genuinely use up an enormous amount of VRAM to instantiate?

My experience is that PAR is usable with a local machine’s RAM in many cases, especially when it's set up with PII columns and a sequence_index.

But we are aware of some performance issues popping up over time in #1965, and we are working to uncover root cause(s). I see you’ve also replied there.

Is there no option for batch sizes?

All available parameters are documented in our website. We keep our docs up-to-date so you are already referencing those, then you are at the right place!

Just based on the algorithm definition of PAR, batching within a sequence is not trivial. For more info, see Preprint.

JonathanBhimani-Burrows added bug Something isn't working new Automatic label applied to new issues labels May 16, 2024

npatki added data:sequential Related to timeseries datasets under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParSynthesizer trying to allocate an absurd about of memory for a small dataset #2012

ParSynthesizer trying to allocate an absurd about of memory for a small dataset #2012

JonathanBhimani-Burrows commented May 16, 2024

npatki commented May 20, 2024

JonathanBhimani-Burrows commented May 22, 2024

npatki commented May 28, 2024

ParSynthesizer trying to allocate an absurd about of memory for a small dataset #2012

ParSynthesizer trying to allocate an absurd about of memory for a small dataset #2012

Comments

JonathanBhimani-Burrows commented May 16, 2024

Environment Details

Error Description

Steps to reproduce

npatki commented May 20, 2024

JonathanBhimani-Burrows commented May 22, 2024

npatki commented May 28, 2024