Class Imbalances #1967

bdytx5 · 2024-04-28T20:55:00Z

Does class imbalance need to be handled without data before synthesizing new data with synthesizers like the CTGan etc? Also, is there a way to validate the model in order to ensure that it isn't overfitting?

srinify · 2024-05-09T18:48:47Z

Hi there @bdytx5 👋

Class imbalance

SDV's synthesizers were designed to model the patterns in your original data so that they can generate new, representative examples. The synthesizers will usually model the class imbalance (e.g. class A is 5% of the rows and class B is 95% of the rows) and generate synthetic examples with the same proportions.

Overfitting

I'd highly recommend trying out GaussianCopulaSynthesizer as an alternative to CTGANSynthesizer. It's very good at avoiding overfitting, is significantly faster to train, and results in identical model performance / quality.

Even if you decide to stick with CTGANSynthesizer, our data preprocessing step that happens in the background before the model is trained also helps the model overfitting.

Every time you generate synthetic data, we recommend running through our evaluation tools to inspect and understand how the model performed by comparing the synthetic data with the real data.

DiagnosticReport helps validate that basic criteria were maintained
QualityReport checks for statistical similarity between your real & synthetic data
Our visualizations can help you visually understand the model quality as it pertains to different columns

We created an open source library called SDMetrics if you want to go even deeper on synthetic data quality. For these 3 features I mentioned, SDV uses this library as well.

bdytx5 added bug Something isn't working new Automatic label applied to new issues labels Apr 28, 2024

bdytx5 changed the title ~~CTGAN Synthesizer not using GPU~~ NA Apr 28, 2024

bdytx5 closed this as completed Apr 28, 2024

bdytx5 reopened this Apr 28, 2024

bdytx5 changed the title NA Class Imbalances Apr 28, 2024

srinify added under discussion Issue is currently being discussed and removed bug Something isn't working new Automatic label applied to new issues labels May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Class Imbalances #1967

Class Imbalances #1967

bdytx5 commented Apr 28, 2024 •

edited

srinify commented May 9, 2024 •

edited

Class Imbalances #1967

Class Imbalances #1967

Comments

bdytx5 commented Apr 28, 2024 • edited

srinify commented May 9, 2024 • edited

bdytx5 commented Apr 28, 2024 •

edited

srinify commented May 9, 2024 •

edited