Lossvalues are good, but the quality of the synthetic data is bad... How?? #2010

ilkayyuksel · 2024-05-16T13:01:38Z

I am using the CTGAN Model for my masterthesis, i want to generate synthetic data using dataset UNSW_NB15 (intrusion detection system dataset, zo it contains attacks). I want to generate synthetic data of 'Generic attacks', which counts 58871 real samples to train with.

I have trained my CTGAN model with the following code:

from ctgan import CTGAN


ctgan = CTGAN(epochs=600, verbose=True, generator_lr=1e-5, discriminator_lr=1e-6, batch_size=128, pac=2, generator_decay=1e-6,
                 discriminator_decay=1e-6, discriminator_steps=1)
ctgan.fit(real_data, discrete_columns)

lossvalues:

Those are my lossvalues for my generator and discriminator, if you look at the discussion #980 , you would expect really good synthetic data generated by the CTGAN Model.

But if I use the metrics from SDV, comparing the real data with the synthetic data, the scores from the metrics are bad:

KS_complement:

Column: dur , Score:  0.47134738665896614
Column: spkts , Score:  0.6197188938526609
Column: dpkts , Score:  0.723784647789234
Column: sbytes , Score:  0.30066847853781997
Column: dbytes , Score:  0.39178464778923405
Column: rate , Score:  0.5549714460430433
Column: sload , Score:  0.6335265580676395
Column: dload , Score:  0.3017846477892341
Column: sloss , Score:  0.777054237230555
Column: dloss , Score:  0.7870712235226173
Column: sinpkt , Score:  0.47327176368670476
Column: dinpkt , Score:  0.36778464778923403
Column: sjit , Score:  0.30070190756059856
Column: djit , Score:  0.4202410864432403
Column: swin , Score:  0.7760882098146795
Column: stcpb , Score:  0.3720882098146795
Column: dtcpb , Score:  0.44599999999999995
Column: dwin , Score:  0.7760882098146795
Column: tcprtt , Score:  0.37802026464643035
Column: synack , Score:  0.483
Column: ackdat , Score:  0.485
Column: smean , Score:  0.3137999864109664
Column: dmean , Score:  0.713784647789234
Column: trans_depth , Score:  0.8535419136756637
Column: response_body_len , Score:  0.4398115031169847
Column: ct_src_dport_ltm , Score:  0.36141310662295534
Column: ct_dst_sport_ltm , Score:  0.3924831920640043
Column: ct_flw_http_mthd , Score:  0.8455249273836014

Average:  0.5271555622826664

TV_complement:

Column: proto , Score:  0.42350683698255553
Column: service , Score:  0.28905916325525305
Column: state , Score:  0.6670394421701688

Average:  0.45986848080265913

The visual distributions of each feature are also bad.

Can you help me? what did I wrong? Why have the fake samples bad quality?

PS. If I use SMOTE, the scores of the SDV metrics are better. But I have to use a GAN model...

The text was updated successfully, but these errors were encountered:

srinify · 2024-05-22T15:19:54Z

Hi there @ilkayyuksel 👋

Do you mind sharing some visualizations of what your marginal distributions look like? This would help us understand if they're bi-model, skewed, etc.

In general, the loss chart looks good and that can correlate with high quality synthetic data but it's not always the case with CTGAN. GAN's in general can be cumbersome to tweak (which is often why we point people to using Gaussian Copulas instead!) but it seems like this is the approach you'll need to take.

Some potential avenues to consider:

Pre-process the data more thoroughly to make it easier for CTGAN to capture the patterns better. If you're able to use SDV instead of CTGAN directly, we do some pre-processing for you based on the metadata and we make it easy for you to tweak the data transformations. If that's interesting to you, check out CTGANSynthesizer from SDV.
Tune the hyperparameter using an external library. BTB is one that comes to mind (but we aren't experts in this ourselves so we can't provide specific support). You can read our FAQ article here on tuning hyperparameters.

srinify · 2024-06-03T15:21:17Z

Hi there @ilkayyuksel I'm closing this issue out for now since I haven't heard from you in a while. But comment here and we can re-open if you still need guidance!

I'd also encourage you to join our Slack community if you aren't there already :)

ilkayyuksel added new Automatic label applied to new issues question General question about the software labels May 16, 2024

srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels May 22, 2024

srinify closed this as completed Jun 3, 2024

srinify added resolution:cannot replicate The problem cannot be replicated and removed under discussion Issue is currently being discussed labels Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lossvalues are good, but the quality of the synthetic data is bad... How?? #2010

Lossvalues are good, but the quality of the synthetic data is bad... How?? #2010

ilkayyuksel commented May 16, 2024

srinify commented May 22, 2024

srinify commented Jun 3, 2024

Lossvalues are good, but the quality of the synthetic data is bad... How?? #2010

Lossvalues are good, but the quality of the synthetic data is bad... How?? #2010

Comments

ilkayyuksel commented May 16, 2024

srinify commented May 22, 2024

srinify commented Jun 3, 2024