Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lossvalues are good, but the quality of the synthetic data is bad... How?? #2010

Closed
ilkayyuksel opened this issue May 16, 2024 · 2 comments
Closed
Labels
question General question about the software resolution:cannot replicate The problem cannot be replicated

Comments

@ilkayyuksel
Copy link

I am using the CTGAN Model for my masterthesis, i want to generate synthetic data using dataset UNSW_NB15 (intrusion detection system dataset, zo it contains attacks). I want to generate synthetic data of 'Generic attacks', which counts 58871 real samples to train with.

I have trained my CTGAN model with the following code:

from ctgan import CTGAN


ctgan = CTGAN(epochs=600, verbose=True, generator_lr=1e-5, discriminator_lr=1e-6, batch_size=128, pac=2, generator_decay=1e-6,
                 discriminator_decay=1e-6, discriminator_steps=1)
ctgan.fit(real_data, discrete_columns)

lossvalues:

image

Those are my lossvalues for my generator and discriminator, if you look at the discussion #980 , you would expect really good synthetic data generated by the CTGAN Model.

But if I use the metrics from SDV, comparing the real data with the synthetic data, the scores from the metrics are bad:

KS_complement:

Column: dur , Score:  0.47134738665896614
Column: spkts , Score:  0.6197188938526609
Column: dpkts , Score:  0.723784647789234
Column: sbytes , Score:  0.30066847853781997
Column: dbytes , Score:  0.39178464778923405
Column: rate , Score:  0.5549714460430433
Column: sload , Score:  0.6335265580676395
Column: dload , Score:  0.3017846477892341
Column: sloss , Score:  0.777054237230555
Column: dloss , Score:  0.7870712235226173
Column: sinpkt , Score:  0.47327176368670476
Column: dinpkt , Score:  0.36778464778923403
Column: sjit , Score:  0.30070190756059856
Column: djit , Score:  0.4202410864432403
Column: swin , Score:  0.7760882098146795
Column: stcpb , Score:  0.3720882098146795
Column: dtcpb , Score:  0.44599999999999995
Column: dwin , Score:  0.7760882098146795
Column: tcprtt , Score:  0.37802026464643035
Column: synack , Score:  0.483
Column: ackdat , Score:  0.485
Column: smean , Score:  0.3137999864109664
Column: dmean , Score:  0.713784647789234
Column: trans_depth , Score:  0.8535419136756637
Column: response_body_len , Score:  0.4398115031169847
Column: ct_src_dport_ltm , Score:  0.36141310662295534
Column: ct_dst_sport_ltm , Score:  0.3924831920640043
Column: ct_flw_http_mthd , Score:  0.8455249273836014

Average:  0.5271555622826664

TV_complement:

Column: proto , Score:  0.42350683698255553
Column: service , Score:  0.28905916325525305
Column: state , Score:  0.6670394421701688

Average:  0.45986848080265913

The visual distributions of each feature are also bad.

Can you help me? what did I wrong? Why have the fake samples bad quality?

PS. If I use SMOTE, the scores of the SDV metrics are better. But I have to use a GAN model...

@ilkayyuksel ilkayyuksel added new Automatic label applied to new issues question General question about the software labels May 16, 2024
@srinify srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels May 22, 2024
@srinify
Copy link
Contributor

srinify commented May 22, 2024

Hi there @ilkayyuksel 👋

Do you mind sharing some visualizations of what your marginal distributions look like? This would help us understand if they're bi-model, skewed, etc.

In general, the loss chart looks good and that can correlate with high quality synthetic data but it's not always the case with CTGAN. GAN's in general can be cumbersome to tweak (which is often why we point people to using Gaussian Copulas instead!) but it seems like this is the approach you'll need to take.

Some potential avenues to consider:

  • Pre-process the data more thoroughly to make it easier for CTGAN to capture the patterns better. If you're able to use SDV instead of CTGAN directly, we do some pre-processing for you based on the metadata and we make it easy for you to tweak the data transformations. If that's interesting to you, check out CTGANSynthesizer from SDV.

  • Tune the hyperparameter using an external library. BTB is one that comes to mind (but we aren't experts in this ourselves so we can't provide specific support). You can read our FAQ article here on tuning hyperparameters.

@srinify
Copy link
Contributor

srinify commented Jun 3, 2024

Hi there @ilkayyuksel I'm closing this issue out for now since I haven't heard from you in a while. But comment here and we can re-open if you still need guidance!

I'd also encourage you to join our Slack community if you aren't there already :)

@srinify srinify closed this as completed Jun 3, 2024
@srinify srinify added resolution:cannot replicate The problem cannot be replicated and removed under discussion Issue is currently being discussed labels Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:cannot replicate The problem cannot be replicated
Projects
None yet
Development

No branches or pull requests

2 participants