-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParSynthesizer trying to allocate an absurd about of memory for a small dataset #2012
Comments
Hi @JonathanBhimani-Burrows thanks for reaching out, and from sharing your metadata. One thing I notice from your metadata is that some of the columns are marked as Brief description of how SDV works:
Changes I would make to your metadata: The following columns should probably be anonymized --
Resources: |
Thanks for the reply, but this didn't really answer the question |
Hi @JonathanBhimani-Burrows you are welcome.
I cannot give you an answer without learning more. Metadata is intrinsically related to performance, and I’ve seen multiple cases where a PII/categorical mixup has led to issues. I appreciate you sharing that these columns are meant to be categorical, something I’m curious to know more about. I suspect that the columns you’ve marked as categorical may be high cardinality, which is known to cause issues (expected). If you are willing to entertain an experiment, updating them to PII like I previously mentioned will help verify (or rule out) this guess — regardless of what your intended usage may be.
My experience is that PAR is usable with a local machine’s RAM in many cases, especially when it's set up with PII columns and a But we are aware of some performance issues popping up over time in #1965, and we are working to uncover root cause(s). I see you’ve also replied there.
All available parameters are documented in our website. We keep our docs up-to-date so you are already referencing those, then you are at the right place! Just based on the algorithm definition of PAR, batching within a sequence is not trivial. For more info, see Preprint. |
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
I'm trying to use the parsynthesizer to create synthetic data to improve model performance. However, it seems to have some serious issues when trying to fit the data. I might be missing something, but here is the setup
{
"columns": {
"User_Lookup_Id": {
"sdtype": "id"
},
"Revenue_Date": {
"sdtype": "datetime",
"datetime_format": "%Y-%m-%d"
},
"Revenue_Amount": {
"sdtype": "numerical"
},
"User_First_Name": {
"pii": false,
"sdtype": "first_name"
},
"Gender": {
"sdtype": "categorical"
},
"Address_City": {
"sdtype": "categorical"
},
"Primary_Address": {
"sdtype": "categorical"
},
"Average_Income": {
"sdtype": "numerical"
},
"Social_Group_Name": {
"sdtype": "categorical"
},
"Spouse": {
"sdtype": "categorical"
},
"Active_Email": {
"pii": false,
"sdtype": "email"
},
"dummy_income": {
"sdtype": "numerical",
"computer_representation": "Float"
}
},
context columns are ['User_First_Name', 'Address_City','Gender','Primary_Address','Social_Group_Name','Spouse_Is_Active','Active_Email', 'dummy_income']
synthesizer = PARSynthesizer(metadata, verbose=True,context_columns=context_cols, enforce_min_max_values =True)
When I try to run the model, it tries to allocate 451 GB GPU for 168k rows, which is absurd
Segment size seems alleviate this, but it limits the model to only producing segments of len segment_size, which is problematic if you have sequences > segment size
(I can't upload the data for confidentiality reasons unfortunately)
Is there something I'm missing? Is this expected behavior? Cause if so, some optimization is necessary, as 168k rows is not very much to train the model (full dataset is 28 mil). Is there not a batch size parameter that could be configured for this model?
Thanks for your help
Steps to reproduce
<Replace this text with a description of the steps that anyone can follow to reproduce the error. If the error happens only on a specific dataset, please consider attaching some example data to the issue so that others can use it to reproduce the error.>
The text was updated successfully, but these errors were encountered: