Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PARSynthesizer's performance #1965

Open
srinify opened this issue Apr 26, 2024 · 3 comments
Open

Optimize PARSynthesizer's performance #1965

srinify opened this issue Apr 26, 2024 · 3 comments
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature

Comments

@srinify
Copy link

srinify commented Apr 26, 2024

Problem Description

A number of SDV users have run into performance issues when using PARSynthesizer with their data. The issues usually manifest as regular out-of-memory errors or CUDA out-of-memory errors. Other times, it just takes a long time to train the model.

I'm creating this thread to collect all of these examples from the community so the SDV core team has the context they need to understand and improve the performance of PARSynthesizer.

For anyone using SDV PARSynthesizer, please add new examples of performance issues to this thread!

@srinify srinify added feature request Request for a new feature new Automatic label applied to new issues data:sequential Related to timeseries datasets and removed new Automatic label applied to new issues labels Apr 26, 2024
@srinify
Copy link
Author

srinify commented Apr 26, 2024

Reported Example 1

Out of regular memory error

#1952 by @prupireddy

RuntimeError: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 683656 bytes. 

"I find this particularly surprising given that I am running this on a machine with 128 GM RAM and I just restarted it."

Suggested Workaround

My recommendation would be to sample the data to reduce the footprint. You can either use less rows per sequence or try less sequences overall. Start with a much lower sample than you think you need (maybe a 5% sample of your data) and then increase by 5% each time to improve the data generated by the model.

@srinify
Copy link
Author

srinify commented Apr 26, 2024

Reported Example 2

Out of CUDA memory error

https://sdv-space.slack.com/archives/C01GSDFSQ93/p1713451980542979 by Isaac (Slack)

Use Case: PAR for forecasting time series
Scale of data:

  • 50k sequences
  • 45 rows per sequence
  • Total: ~2.2M rows

Attempted Workarounds:

  • Setting lower segment_size resulted in a new PyTorch error:
    • If I try a sequence length of 8, I get:
r.nvmlDeviceGetNvLinkRemoteDeviceType_ INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1712608853099/work/c10/cuda/driver_api.cpp":27, please report a bug to PyTorch. Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType

Example Code (Srini):

import numpy as np
import pandas as pd

# ID column
ids = np.arange(0, 50_000, 1)
ids = np.repeat(ids, 45)

# Sequence Index Column
ticks = np.arange(0, 45, 1)
ticks = np.tile(ticks, 50_000)

# Observations Column
obs = np.concatenate(
    [np.random.normal(loc=5, scale=1, size=1) for i in ids]
)

df = pd.DataFrame(
    {
        "id": ids,
        "ticks": ticks,
        "obs": obs
    }
)

from sdv.sequential import PARSynthesizer
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df)
metadata.update_column(column_name='id', sdtype='id')
metadata.set_sequence_key(column_name='id')
metadata.set_sequence_index(column_name='ticks')

synthesizer = PARSynthesizer(metadata, verbose=True)
synthesizer.fit(df)

@liuup
Copy link

liuup commented Apr 27, 2024

Reported Example 1

Out of regular memory error

#1952 by @prupireddy

RuntimeError: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 683656 bytes. 

"I find this particularly surprising given that I am running this on a machine with 128 GM RAM and I just restarted it."

Suggested Workaround

My recommendation would be to sample the data to reduce the footprint. You can either use less rows per sequence or try less sequences overall. Start with a much lower sample than you think you need (maybe a 5% sample of your data) and then increase by 5% each time to improve the data generated by the model.

Bro, I recently meet the problem in example 1, how I solve this problem is to modify the segment_size from default to 5 or 10 or bigger which can decrease the calculation time. I don't know if this can help you, but it does works on my computer. And my definition of PARSynthesizer maybe like this:

"""     Step1:    Create the synthesizer    """
synthesizer = PARSynthesizer(
    metadata,
    cuda =  True,
    verbose = True,
    epochs = 512,
    segment_size = 5,
    sample_size = 20,
)

The explaination of segment_size is right here: https://docs.sdv.dev/sdv/sequential-data/modeling/parsynthesizer#:~:text=segment_size,into%20any%20segments.
Hope this can help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants