Release v1.0.0 - 2023-03-28 · sdv-dev/SDV

This is a major release that introduces a new API to the SDV aimed at streamlining the process of synthetic data generation! To achieve this, this release includes the addition of several large features.

Metadata

Some of the most notable additions are the new SingleTableMetadata and MultiTableMetadata classes. These classes enable a number of features that make it easier to synthesize your data correctly such as:

Automatic data detection - Calling metadata.detect_from_dataframe() or metadata.detect_from_csv() will populate the metadata autonomously with values it thinks represent the data.
Easy updating - Once an instance of the metadata is created, values can be easily updated using a number of methods defined in the API. For more info, view the docs.
Metadata validation - Calling metadata.validate() will return a report of any invalid definitions in the metadata specification.
Upgrading - Users with the previous metadata format can easily update to the new specification using the upgrade_metadata() method.
Saving and loading - The metadata itself can easily be saved to a json file and loaded back up later.

Class and Module Names

Another major change is the renaming of our core modeling classes and modules. The name changes are meant to highlight the difference between the underlying machine learning models, and the objects responsible for the end-to-end workflow of generating synthetic data. The main name changes are as follows:

tabular -> single_table
relational -> multi_table
timeseries -> sequential
BaseTabularModel -> BaseSingleTableSynthesizer
GaussianCopula -> GaussianCopulaSynthesizer
CTGAN -> CTGANSynthesizer
TVAE -> TVAESynthesizer
CopulaGan -> CopulaGANSynthesizer
PAR -> PARSynthesizer
HMA1 -> HMASynthesizer

In SDV 1.0, synthesizers are classes that take in metadata and handle data preprocessing, model training and model sampling. This is similar to the previous BaseTabularModel in SDV <1.0.

Synthetic Data Workflow

Synthesizers in SDV 1.0 define a clear workflow for generating synthetic data.

Synthesizers are initialized with a metadata class.
They can then be used to transform the data and apply constraints using the synthesizer.preprocess() method. This step also validates that the data matches the provided metadata to avoid errors in fitting or sampling.
The processed data can then be fed into the underlying machine learning model using synthesizer.fit_processed_data(). (Alternatively, data can be preprocessed and fit to the model using synthesizer.fit().)
Data can then be sampled using synthesizer.sample().

Each synthesizer class also provides a series of methods to help users customize the transformations their data goes through. Read more about that here.

Notice that the preprocessing and model fitting steps can now be separated. This can be helpful if preprocessing is time consuming or if the data has been processed externally.

Other Highly Requested Features

Another major addition is control over randomization. In SDV <1.0, users could set a seed to control the randomization for only some columns. In SDV 1.0, randomization is controlled for all columns. Every new call to sample generates new data, but the synthesizer's seed can be reset to the original state using synthesizer.reset_randomization(), enabling reproducibility.

SDV 1.0 adds accessibility and transparency into the transformers used for preprocessing and underlying machine learning models.

Using the synthesizer.get_transformers() method, you can access the transformers used to preprocess each column and view their properties. This can be useful for debugging and accessing privacy information like mappings used to mask data.
Distribution parameters learned by copula models can be accessed using the synthesizer.get_learned_distributions() method.

PII handling is improved by the following features:

Primary keys can be set to natural sdtypes (eg. SSN, email, name). Previously they could only be numerical or text.
The PseudoAnonymizedFaker can be used to provide consistent mapping to PII columns. As mentioned before, the mapping itself can be accessed by viewing the transformers for the column using synthesizer.get_transformers().
A bug causing PII columns to slow down modeling is patched.

Finally, the synthetic data can now be easily evaluated using the evaluate_quality() and run_diagnostic() methods. The data can be compared visually to the actual data using the get_column_plot() and get_column_pair_plot() methods. For more info on how to visualize or interpret the synthetic data evaluation, read the docs here.

Issues Resolved

New Features

Change auto_assign_transformers to handle id types - Issue #1325 by @pvk-developer
Change 'text' sdtype to 'id' - Issue #1324 by @frances-h
In upgrade_metadata, return the object instead of writing it to a JSON file - Issue #1319 by @frances-h
In upgrade_metadata index primary keys should be converted to text - Issue #1318 by @amontanez24
Add load_from_dict to SingleTableMetadata and MultiTableMetadata - Issue #1314 by @amontanez24
Throw a SynthesizerInputError if FixedCombinations constraint is applied to a column that is not boolean or categorical - Issue #1306 by @frances-h
Missing save and load methods for HMASynthesizer - Issue #1262 by @amontanez24
Better input validation when creating single and multi table synthesizers - Issue #1242 by @fealho
Better input validation on HMASynthesizer.sample - Issue #1241 by @R-Palazzo
Validate that relationship must be between a primary key and foreign key - Issue #1236 by @fealho
Improve update_column validation for pii attribute - Issue #1226 by @pvk-developer
Order the output of get_transformers() based on the metadata - Issue #1222 by @pvk-developer
Log if any numerical_distributions will not be applied - Issue #1212 by @fealho
Improve error handling for GaussianCopulaSynthesizer: numerical_distributions - Issue #1211 by @fealho
Improve error handling when validating constraints - Issue #1210 by @fealho
Add fake_companies demo - Issue #1209 by @amontanez24
Allow me to create a custom constraint class and use it in the same file - Issue #1205 by @amontanez24
Sampling should reset after retraining the model - Issue #1201 by @pvk-developer
Change function name HMASynthesizer.update_table_parameters --> set_table_parameters - Issue #1200 by @pvk-developer
Add get_info method to synthesizers - Issue #1199 by @fealho
Add evaluation methods to synthesizer - Issue #1190 by @fealho
Update evaluate.py to work with the new metadata - Issue #1186 by @fealho
Remove old code - Issue #1181 by @pvk-developer
Drop support for python 3.6 and add support for 3.10 - Issue #1176 by @fealho
Add constraint methods to MultiTableSynthesizers - Issue #1171 by @fealho
Update custom constraint workflow - Issue #1169 by @pvk-developer
Add get_constraints method to synthesizers - Issue #1168 by @pvk-developer
Migrate adding and validating constraints to BaseSynthesizer - Issue #1163 by @pvk-developer
Change metadata "SCHEMA_VERSION" --> "METADATA_SPEC_VERSION" - Issue #1139 by @amontanez24
Add ability to reset random sampling - Issue #1130 by @pvk-developer
Add get_available_demos - Issue #1129 by @fealho
Add demo loading functionality - Issue #1128 by @fealho
Use logging instead of printing in detect methods - Issue #1107 by @fealho
Add save and load methods to synthesizers - Issue #1106 by @pvk-developer
Add sampling methods to PARSynthesizer - Issue #1083 by @amontanez24
Add transformer methods to PARSynthesizer - Issue #1082 by @fealho
Add validate to PARSynthesizer - Issue #1081 by @amontanez24
Add preprocess and fit methods to PARSynthesizer - Issue #1080 by @amontanez24
Create SingleTablePreset - Issue #1079 by @amontanez24
Add sample method to multi-table synthesizers - Issue #1078 by @pvk-developer
Add get_learned_distributions method to synthesizers - Issue #1075 by @pvk-developer
Add preprocess and fit methods to multi-table synthesizers - Issue #1074 by @pvk-developer
Add transformer related methods to BaseMultiTableSynthesizer - Issue #1072 by @fealho
Add validate method to BaseMultiTableSynthesizer - Issue #1071 by @pvk-developer
Create BaseMultiTableSynthesizer and HMASynthesizer classes - Issue #1070 by @pvk-developer
Create PARSynthesizer - Issue #1055 by @amontanez24
Raise an error if an invalid sdtype is provided to the metadata - Issue #1042 by @amontanez24
Only allow datetime and numerical sdtypes to be set as the sequence index - Issue #1030 by @amontanez24
Change set_alternate_keys to add_alternate_keys and add error handling - Issue #1029 by @amontanez24
Create MultiTableMetadata.add_table method - Issue #1024 by @amontanez24
Add update_transformers to synthesizers - Issue #1021 by @fealho
Add assign_transformers and get_transformers methods to synthesizers - Issue #1020 by @pvk-developer
Add fit and fit_processed_data methods to synthesizers - Issue #1019 by @pvk-developer
Add preprocess method to synthesizers - Issue #1018 by @pvk-developer
Add sampling to synthesizer classes - Issue #1015 by @pvk-developer
Add validate method to synthesizer - Issue #1014 by @fealho
Create GaussianCopula, CTGAN, TVAE and CopulaGAN synthesizer classes - Issue #1013 by @pvk-developer
Create BaseSynthesizer class - Issue #1012 by @pvk-developer
Add constraint conversion to upgrade_metadata - Issue #1005 by @amontanez24
Add method to generate keys to DataProcessor - Issue #994 by @pvk-developer
Create formatter - Issue #970 by @fealho
Create a utility to load multiple CSV files at once - Issue #969 by @amontanez24
Create a utility to convert old --> new metadata format - Issue #966 by @amontanez24
Add validation check that primary_key, alternate_keys and sequence_key cannot be sdtype categorical - Issue #963 by @fealho
Add anonymization to DataProcessor - Issue #950 by @pvk-developer
Add utility methods to DataProcessor - Issue #948 by @fealho
Add fit, transform and reverse_transform to DataProcessor - Issue #947 by @amontanez24
Create DataProcessor class - Issue #946 by @amontanez24
Add add_constraint method to MultiTableMetadata - Issue #895 by @amontanez24
Add key related methods to MultiTableMetadata - Issue #894 by @fealho
Add update_column and add_column methods to MultiTableMetadata - Issue #893 by @amontanez24
Add detect methods to MultiTableMetadata - Issue #892 by @amontanez24
Add load_from_json and save_to_json methods to the MultiTableMetadata - Issue #891 by @fealho
Add add_relationship method to MultiTableMetadata - Issue #890 by @pvk-developer
Add validate method to MultiTableMetadata - Issue #888 by @pvk-developer
Add visualize method to MultiTableMetadata class - Issue #884 by @amontanez24
Create MultiTableMetadata class - Issue #883 by @pvk-developer
Add add_constraint method to SingleTableMetadata - Issue #881 by @amontanez24
Add key related methods to SingleTableMetadata - Issue #880 by @fealho
Add validate method to SingleTableMetadata - Issue #879 by @fealho
Add _validate_inputs class method to each constraint - Issue #878 by @fealho
Add update_column and add_column methods to SingleTableMetadata - Issue #877 by @pvk-developer
Add detect methods to SingleTableMetadata - Issue #876 by @pvk-developer
Add load_from_json and save_to_json methods to SingleTableMetadata - Issue #874 by @pvk-developer
Create SingleTableMetadata class - Issue #873 by @pvk-developer

Bugs Fixed

In upgrade_metadata, PII values are being converted to generic categorical columns - Issue #1317 by @frances-h
PARSynthesizer is missing save and load methods - Issue #1289 by @amontanez24
Confusing warning when updating transformers - Issue #1272 by @frances-h
When adding constraints, auto_assign_transformers is showing columns that should no longer exist - Issue #1260 by @pvk-developer
Cannot fit twice if I modify transformers: ValueError: There are non-numerical values in your data. - Issue #1259 by @frances-h
Cannot fit twice if I add constraints: ValueError: There are non-numerical values in your data. - Issue #1258 by @frances-h
HMASynthesizer errors out when fitting a dataset that has a table which holds primary key and foreign keys only - Issue #1257 by @pvk-developer
Change ValueErrors to InvalidMetadataErrors - Issue #1251 by @frances-h
Multi-table should show foreign key transformers as None - Issue #1249 by @frances-h
Cannot use HMASynthesizer.fit_processed_data more than once (KeyError) - Issue #1240 by @frances-h
Function get_available_demos crashes if a dataset's num-tables or size-MB cannot be found - Issue #1215 by @amontanez24
Cannot supply a natural key to HMASynthesizer (where sdtype is custom): Error in sample - Issue #1214 by @pvk-developer
Unable to sample when using a PseudoAnonymizedFaker - Issue #1207 by @pvk-developer
Incorrect sdtype specified in demo dataset student_placements_pii - Issue #1206 by @amontanez24
Auto assigned transformers for datetime columns don't have the right parameters - Issue #1204 by @pvk-developer
Cannot apply Inequality constraint on demo dataset's datetime columns - Issue #1203 by @pvk-developer
pii should not be required to auto-assign faker transformers - Issue #1194 by @pvk-developer
Misc. bug fixes for SDV 1.0.0 - Issue #1193 by @pvk-developer
Small bug fixes in demo module - Issue #1192 by @pvk-developer
Foreign Keys are added as Alternate Keys when upgrading - Issue #1143 by @pvk-developer
Alternate keys not unique when assigned to a semantic type - Issue #1111 by @pvk-developer
Synthesizer errors if column is semantic type and pii is False - Issue #1110 by @fealho
Sampled values not unique if primary key is numerical - Issue #1109 by @pvk-developer
Validate not called during synthesizer creation - Issue #1105 by @pvk-developer
SingleTableSynthesizer fit doesn't update rounding - Issue #1104 by @amontanez24
Method auto_assign_tranformers always sets enforce_min_max_values=True - Issue #1095 by @fealho
Sampled context columns in PAR must be in the same order - Issue #1052 by @amontanez24
Incorrect schema version printing during detect_table_from_dataframe - Issue #1038 by @amontanez24
Same relationship can be added twice to MultiTableMetadata - Issue #1031 by @amontanez24
Miscellaneous metadata bugs - Issue #1026 by @amontanez24

Maintenance

SDV Package Maintenance Updates - Issue #1140 by @amontanez24

Internal

Add integration tests for 'Synthesize Sequences' demo - Issue #1295 by @pvk-developer
Add integration tests for 'Adding Constraints' demo - Issue #1280 by @pvk-developer
Add integration tests to the 'Use Your Own Data' demo - Issue #1278 by @frances-h
Add integration tests for 'Synthesize Multi Tables' demo - Issue #1277 by @pvk-developer
Add integration tests for 'Synthesize a Table' demo - Issue #1276 by @frances-h
Update get_available_demos tests - Issue #1247 by @fealho
Make private attributes public in the metadata - Issue #1245 by @fealho

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0 - 2023-03-28