v1.0.0 - 2023-03-28
This is a major release that introduces a new API to the SDV
aimed at streamlining the process of synthetic data generation! To achieve this, this release includes the addition of several large features.
Metadata
Some of the most notable additions are the new SingleTableMetadata
and MultiTableMetadata
classes. These classes enable a number of features that make it easier to synthesize your data correctly such as:
- Automatic data detection - Calling
metadata.detect_from_dataframe()
ormetadata.detect_from_csv()
will populate the metadata autonomously with values it thinks represent the data. - Easy updating - Once an instance of the metadata is created, values can be easily updated using a number of methods defined in the API. For more info, view the docs.
- Metadata validation - Calling
metadata.validate()
will return a report of any invalid definitions in the metadata specification. - Upgrading - Users with the previous metadata format can easily update to the new specification using the
upgrade_metadata()
method. - Saving and loading - The metadata itself can easily be saved to a json file and loaded back up later.
Class and Module Names
Another major change is the renaming of our core modeling classes and modules. The name changes are meant to highlight the difference between the underlying machine learning models, and the objects responsible for the end-to-end workflow of generating synthetic data. The main name changes are as follows:
tabular
->single_table
relational
->multi_table
timeseries
->sequential
BaseTabularModel
->BaseSingleTableSynthesizer
GaussianCopula
->GaussianCopulaSynthesizer
CTGAN
->CTGANSynthesizer
TVAE
->TVAESynthesizer
CopulaGan
->CopulaGANSynthesizer
PAR
->PARSynthesizer
HMA1
->HMASynthesizer
In SDV
1.0, synthesizers are classes that take in metadata and handle data preprocessing, model training and model sampling. This is similar to the previous BaseTabularModel
in SDV
<1.0.
Synthetic Data Workflow
Synthesizers
in SDV
1.0 define a clear workflow for generating synthetic data.
- Synthesizers are initialized with a metadata class.
- They can then be used to transform the data and apply constraints using the
synthesizer.preprocess()
method. This step also validates that the data matches the provided metadata to avoid errors in fitting or sampling. - The processed data can then be fed into the underlying machine learning model using
synthesizer.fit_processed_data()
. (Alternatively, data can be preprocessed and fit to the model usingsynthesizer.fit()
.) - Data can then be sampled using
synthesizer.sample()
.
Each synthesizer class also provides a series of methods to help users customize the transformations their data goes through. Read more about that here.
Notice that the preprocessing and model fitting steps can now be separated. This can be helpful if preprocessing is time consuming or if the data has been processed externally.
Other Highly Requested Features
Another major addition is control over randomization. In SDV
<1.0, users could set a seed to control the randomization for only some columns. In SDV
1.0, randomization is controlled for all columns. Every new call to sample generates new data, but the synthesizer's seed can be reset to the original state using synthesizer.reset_randomization()
, enabling reproducibility.
SDV 1.0
adds accessibility and transparency into the transformers used for preprocessing and underlying machine learning models.
- Using the
synthesizer.get_transformers()
method, you can access the transformers used to preprocess each column and view their properties. This can be useful for debugging and accessing privacy information like mappings used to mask data. - Distribution parameters learned by copula models can be accessed using the
synthesizer.get_learned_distributions()
method.
PII handling is improved by the following features:
- Primary keys can be set to natural sdtypes (eg. SSN, email, name). Previously they could only be numerical or text.
- The
PseudoAnonymizedFaker
can be used to provide consistent mapping to PII columns. As mentioned before, the mapping itself can be accessed by viewing the transformers for the column usingsynthesizer.get_transformers()
. - A bug causing PII columns to slow down modeling is patched.
Finally, the synthetic data can now be easily evaluated using the evaluate_quality()
and run_diagnostic()
methods. The data can be compared visually to the actual data using the get_column_plot()
and get_column_pair_plot()
methods. For more info on how to visualize or interpret the synthetic data evaluation, read the docs here.
Issues Resolved
New Features
- Change auto_assign_transformers to handle id types - Issue #1325 by @pvk-developer
- Change 'text' sdtype to 'id' - Issue #1324 by @frances-h
- In
upgrade_metadata
, return the object instead of writing it to a JSON file - Issue #1319 by @frances-h - In
upgrade_metadata
index primary keys should be converted totext
- Issue #1318 by @amontanez24 - Add
load_from_dict
to SingleTableMetadata and MultiTableMetadata - Issue #1314 by @amontanez24 - Throw a
SynthesizerInputError
ifFixedCombinations
constraint is applied to a column that is notboolean
orcategorical
- Issue #1306 by @frances-h - Missing
save
andload
methods forHMASynthesizer
- Issue #1262 by @amontanez24 - Better input validation when creating single and multi table synthesizers - Issue #1242 by @fealho
- Better input validation on
HMASynthesizer.sample
- Issue #1241 by @R-Palazzo - Validate that relationship must be between a
primary key
andforeign key
- Issue #1236 by @fealho - Improve
update_column
validation forpii
attribute - Issue #1226 by @pvk-developer - Order the output of
get_transformers()
based on the metadata - Issue #1222 by @pvk-developer - Log if any
numerical_distributions
will not be applied - Issue #1212 by @fealho - Improve error handling for
GaussianCopulaSynthesizer
:numerical_distributions
- Issue #1211 by @fealho - Improve error handling when validating
constraints
- Issue #1210 by @fealho - Add
fake_companies
demo - Issue #1209 by @amontanez24 - Allow me to create a custom constraint class and use it in the same file - Issue #1205 by @amontanez24
- Sampling should reset after retraining the model - Issue #1201 by @pvk-developer
- Change function name
HMASynthesizer.update_table_parameters
-->set_table_parameters
- Issue #1200 by @pvk-developer - Add
get_info
method to synthesizers - Issue #1199 by @fealho - Add evaluation methods to synthesizer - Issue #1190 by @fealho
- Update
evaluate.py
to work with the newmetadata
- Issue #1186 by @fealho - Remove old code - Issue #1181 by @pvk-developer
- Drop support for python 3.6 and add support for 3.10 - Issue #1176 by @fealho
- Add constraint methods to MultiTableSynthesizers - Issue #1171 by @fealho
- Update custom constraint workflow - Issue #1169 by @pvk-developer
- Add get_constraints method to synthesizers - Issue #1168 by @pvk-developer
- Migrate adding and validating constraints to BaseSynthesizer - Issue #1163 by @pvk-developer
- Change metadata
"SCHEMA_VERSION"
-->"METADATA_SPEC_VERSION"
- Issue #1139 by @amontanez24 - Add ability to reset random sampling - Issue #1130 by @pvk-developer
- Add get_available_demos - Issue #1129 by @fealho
- Add demo loading functionality - Issue #1128 by @fealho
- Use logging instead of printing in detect methods - Issue #1107 by @fealho
- Add save and load methods to synthesizers - Issue #1106 by @pvk-developer
- Add sampling methods to PARSynthesizer - Issue #1083 by @amontanez24
- Add transformer methods to PARSynthesizer - Issue #1082 by @fealho
- Add validate to PARSynthesizer - Issue #1081 by @amontanez24
- Add preprocess and fit methods to PARSynthesizer - Issue #1080 by @amontanez24
- Create SingleTablePreset - Issue #1079 by @amontanez24
- Add sample method to multi-table synthesizers - Issue #1078 by @pvk-developer
- Add get_learned_distributions method to synthesizers - Issue #1075 by @pvk-developer
- Add preprocess and fit methods to multi-table synthesizers - Issue #1074 by @pvk-developer
- Add transformer related methods to BaseMultiTableSynthesizer - Issue #1072 by @fealho
- Add validate method to
BaseMultiTableSynthesizer
- Issue #1071 by @pvk-developer - Create BaseMultiTableSynthesizer and HMASynthesizer classes - Issue #1070 by @pvk-developer
- Create PARSynthesizer - Issue #1055 by @amontanez24
- Raise an error if an invalid sdtype is provided to the metadata - Issue #1042 by @amontanez24
- Only allow datetime and numerical sdtypes to be set as the sequence index - Issue #1030 by @amontanez24
- Change set_alternate_keys to add_alternate_keys and add error handling - Issue #1029 by @amontanez24
- Create
MultiTableMetadata.add_table
method - Issue #1024 by @amontanez24 - Add update_transformers to synthesizers - Issue #1021 by @fealho
- Add assign_transformers and get_transformers methods to synthesizers - Issue #1020 by @pvk-developer
- Add fit and fit_processed_data methods to synthesizers - Issue #1019 by @pvk-developer
- Add preprocess method to synthesizers - Issue #1018 by @pvk-developer
- Add sampling to synthesizer classes - Issue #1015 by @pvk-developer
- Add validate method to synthesizer - Issue #1014 by @fealho
- Create GaussianCopula, CTGAN, TVAE and CopulaGAN synthesizer classes - Issue #1013 by @pvk-developer
- Create BaseSynthesizer class - Issue #1012 by @pvk-developer
- Add constraint conversion to upgrade_metadata - Issue #1005 by @amontanez24
- Add method to generate keys to DataProcessor - Issue #994 by @pvk-developer
- Create formatter - Issue #970 by @fealho
- Create a utility to load multiple CSV files at once - Issue #969 by @amontanez24
- Create a utility to convert old --> new metadata format - Issue #966 by @amontanez24
- Add validation check that
primary_key
,alternate_keys
andsequence_key
cannot be sdtype categorical - Issue #963 by @fealho - Add anonymization to DataProcessor - Issue #950 by @pvk-developer
- Add utility methods to DataProcessor - Issue #948 by @fealho
- Add fit, transform and reverse_transform to DataProcessor - Issue #947 by @amontanez24
- Create DataProcessor class - Issue #946 by @amontanez24
- Add add_constraint method to MultiTableMetadata - Issue #895 by @amontanez24
- Add key related methods to MultiTableMetadata - Issue #894 by @fealho
- Add update_column and add_column methods to MultiTableMetadata - Issue #893 by @amontanez24
- Add detect methods to MultiTableMetadata - Issue #892 by @amontanez24
- Add load_from_json and save_to_json methods to the MultiTableMetadata - Issue #891 by @fealho
- Add add_relationship method to MultiTableMetadata - Issue #890 by @pvk-developer
- Add validate method to MultiTableMetadata - Issue #888 by @pvk-developer
- Add visualize method to MultiTableMetadata class - Issue #884 by @amontanez24
- Create MultiTableMetadata class - Issue #883 by @pvk-developer
- Add add_constraint method to SingleTableMetadata - Issue #881 by @amontanez24
- Add key related methods to SingleTableMetadata - Issue #880 by @fealho
- Add validate method to SingleTableMetadata - Issue #879 by @fealho
- Add _validate_inputs class method to each constraint - Issue #878 by @fealho
- Add update_column and add_column methods to SingleTableMetadata - Issue #877 by @pvk-developer
- Add detect methods to SingleTableMetadata - Issue #876 by @pvk-developer
- Add load_from_json and save_to_json methods to SingleTableMetadata - Issue #874 by @pvk-developer
- Create SingleTableMetadata class - Issue #873 by @pvk-developer
Bugs Fixed
- In
upgrade_metadata
, PII values are being converted to generic categorical columns - Issue #1317 by @frances-h PARSynthesizer
is missingsave
andload
methods - Issue #1289 by @amontanez24- Confusing warning when updating transformers - Issue #1272 by @frances-h
- When adding constraints,
auto_assign_transformers
is showing columns that should no longer exist - Issue #1260 by @pvk-developer - Cannot fit twice if I modify transformers:
ValueError: There are non-numerical values in your data.
- Issue #1259 by @frances-h - Cannot fit twice if I add constraints:
ValueError: There are non-numerical values in your data.
- Issue #1258 by @frances-h HMASynthesizer
errors out when fitting a dataset that has a table which holds primary key and foreign keys only - Issue #1257 by @pvk-developer- Change ValueErrors to InvalidMetadataErrors - Issue #1251 by @frances-h
- Multi-table should show foreign key transformers as None - Issue #1249 by @frances-h
- Cannot use
HMASynthesizer.fit_processed_data
more than once (KeyError
) - Issue #1240 by @frances-h - Function
get_available_demos
crashes if a dataset'snum-tables
orsize-MB
cannot be found - Issue #1215 by @amontanez24 - Cannot supply a natural key to
HMASynthesizer
(wheresdtype
is custom): Error insample
- Issue #1214 by @pvk-developer - Unable to sample when using a
PseudoAnonymizedFaker
- Issue #1207 by @pvk-developer - Incorrect
sdtype
specified in demo datasetstudent_placements_pii
- Issue #1206 by @amontanez24 - Auto assigned transformers for datetime columns don't have the right parameters - Issue #1204 by @pvk-developer
- Cannot apply
Inequality
constraint on demo dataset's datetime columns - Issue #1203 by @pvk-developer - pii should not be required to auto-assign faker transformers - Issue #1194 by @pvk-developer
- Misc. bug fixes for SDV 1.0.0 - Issue #1193 by @pvk-developer
- Small bug fixes in demo module - Issue #1192 by @pvk-developer
- Foreign Keys are added as Alternate Keys when upgrading - Issue #1143 by @pvk-developer
- Alternate keys not unique when assigned to a semantic type - Issue #1111 by @pvk-developer
- Synthesizer errors if column is semantic type and pii is False - Issue #1110 by @fealho
- Sampled values not unique if primary key is numerical - Issue #1109 by @pvk-developer
- Validate not called during synthesizer creation - Issue #1105 by @pvk-developer
- SingleTableSynthesizer fit doesn't update rounding - Issue #1104 by @amontanez24
- Method
auto_assign_tranformers
always setsenforce_min_max_values=True
- Issue #1095 by @fealho - Sampled context columns in PAR must be in the same order - Issue #1052 by @amontanez24
- Incorrect schema version printing during detect_table_from_dataframe - Issue #1038 by @amontanez24
- Same relationship can be added twice to MultiTableMetadata - Issue #1031 by @amontanez24
- Miscellaneous metadata bugs - Issue #1026 by @amontanez24
Maintenance
- SDV Package Maintenance Updates - Issue #1140 by @amontanez24
Internal
- Add integration tests for 'Synthesize Sequences' demo - Issue #1295 by @pvk-developer
- Add integration tests for 'Adding Constraints' demo - Issue #1280 by @pvk-developer
- Add integration tests to the 'Use Your Own Data' demo - Issue #1278 by @frances-h
- Add integration tests for 'Synthesize Multi Tables' demo - Issue #1277 by @pvk-developer
- Add integration tests for 'Synthesize a Table' demo - Issue #1276 by @frances-h
- Update
get_available_demos
tests - Issue #1247 by @fealho - Make private attributes public in the metadata - Issue #1245 by @fealho