You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does class imbalance need to be handled without data before synthesizing new data with synthesizers like the CTGan etc? Also, is there a way to validate the model in order to ensure that it isn't overfitting?
The text was updated successfully, but these errors were encountered:
SDV's synthesizers were designed to model the patterns in your original data so that they can generate new, representative examples. The synthesizers will usually model the class imbalance (e.g. class A is 5% of the rows and class B is 95% of the rows) and generate synthetic examples with the same proportions.
Overfitting
I'd highly recommend trying out GaussianCopulaSynthesizer as an alternative to CTGANSynthesizer. It's very good at avoiding overfitting, is significantly faster to train, and results in identical model performance / quality.
Even if you decide to stick with CTGANSynthesizer, our data preprocessing step that happens in the background before the model is trained also helps the model overfitting.
Every time you generate synthetic data, we recommend running through our evaluation tools to inspect and understand how the model performed by comparing the synthetic data with the real data.
DiagnosticReport helps validate that basic criteria were maintained
QualityReport checks for statistical similarity between your real & synthetic data
Our visualizations can help you visually understand the model quality as it pertains to different columns
We created an open source library called SDMetrics if you want to go even deeper on synthetic data quality. For these 3 features I mentioned, SDV uses this library as well.
Does class imbalance need to be handled without data before synthesizing new data with synthesizers like the CTGan etc? Also, is there a way to validate the model in order to ensure that it isn't overfitting?
The text was updated successfully, but these errors were encountered: