Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class Imbalances #1967

Open
bdytx5 opened this issue Apr 28, 2024 · 1 comment
Open

Class Imbalances #1967

bdytx5 opened this issue Apr 28, 2024 · 1 comment
Labels
under discussion Issue is currently being discussed

Comments

@bdytx5
Copy link

bdytx5 commented Apr 28, 2024

Does class imbalance need to be handled without data before synthesizing new data with synthesizers like the CTGan etc? Also, is there a way to validate the model in order to ensure that it isn't overfitting?

@bdytx5 bdytx5 added bug Something isn't working new Automatic label applied to new issues labels Apr 28, 2024
@bdytx5 bdytx5 changed the title CTGAN Synthesizer not using GPU NA Apr 28, 2024
@bdytx5 bdytx5 closed this as completed Apr 28, 2024
@bdytx5 bdytx5 reopened this Apr 28, 2024
@bdytx5 bdytx5 changed the title NA Class Imbalances Apr 28, 2024
@srinify
Copy link

srinify commented May 9, 2024

Hi there @bdytx5 👋

Class imbalance

SDV's synthesizers were designed to model the patterns in your original data so that they can generate new, representative examples. The synthesizers will usually model the class imbalance (e.g. class A is 5% of the rows and class B is 95% of the rows) and generate synthetic examples with the same proportions.

Overfitting

I'd highly recommend trying out GaussianCopulaSynthesizer as an alternative to CTGANSynthesizer. It's very good at avoiding overfitting, is significantly faster to train, and results in identical model performance / quality.

Even if you decide to stick with CTGANSynthesizer, our data preprocessing step that happens in the background before the model is trained also helps the model overfitting.

Every time you generate synthetic data, we recommend running through our evaluation tools to inspect and understand how the model performed by comparing the synthetic data with the real data.

  • DiagnosticReport helps validate that basic criteria were maintained
  • QualityReport checks for statistical similarity between your real & synthetic data
  • Our visualizations can help you visually understand the model quality as it pertains to different columns

We created an open source library called SDMetrics if you want to go even deeper on synthetic data quality. For these 3 features I mentioned, SDV uses this library as well.

@srinify srinify added under discussion Issue is currently being discussed and removed bug Something isn't working new Automatic label applied to new issues labels May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants