Skip to content

v1.0.0 - 2023-03-28

Compare
Choose a tag to compare
@amontanez24 amontanez24 released this 28 Mar 20:39
· 358 commits to stable since this release

This is a major release that introduces a new API to the SDV aimed at streamlining the process of synthetic data generation! To achieve this, this release includes the addition of several large features.

Metadata

Some of the most notable additions are the new SingleTableMetadata and MultiTableMetadata classes. These classes enable a number of features that make it easier to synthesize your data correctly such as:

  • Automatic data detection - Calling metadata.detect_from_dataframe() or metadata.detect_from_csv() will populate the metadata autonomously with values it thinks represent the data.
  • Easy updating - Once an instance of the metadata is created, values can be easily updated using a number of methods defined in the API. For more info, view the docs.
  • Metadata validation - Calling metadata.validate() will return a report of any invalid definitions in the metadata specification.
  • Upgrading - Users with the previous metadata format can easily update to the new specification using the upgrade_metadata() method.
  • Saving and loading - The metadata itself can easily be saved to a json file and loaded back up later.

Class and Module Names

Another major change is the renaming of our core modeling classes and modules. The name changes are meant to highlight the difference between the underlying machine learning models, and the objects responsible for the end-to-end workflow of generating synthetic data. The main name changes are as follows:

  • tabular -> single_table
  • relational -> multi_table
  • timeseries -> sequential
  • BaseTabularModel -> BaseSingleTableSynthesizer
  • GaussianCopula -> GaussianCopulaSynthesizer
  • CTGAN -> CTGANSynthesizer
  • TVAE -> TVAESynthesizer
  • CopulaGan -> CopulaGANSynthesizer
  • PAR -> PARSynthesizer
  • HMA1 -> HMASynthesizer

In SDV 1.0, synthesizers are classes that take in metadata and handle data preprocessing, model training and model sampling. This is similar to the previous BaseTabularModel in SDV <1.0.

Synthetic Data Workflow

Synthesizers in SDV 1.0 define a clear workflow for generating synthetic data.

  1. Synthesizers are initialized with a metadata class.
  2. They can then be used to transform the data and apply constraints using the synthesizer.preprocess() method. This step also validates that the data matches the provided metadata to avoid errors in fitting or sampling.
  3. The processed data can then be fed into the underlying machine learning model using synthesizer.fit_processed_data(). (Alternatively, data can be preprocessed and fit to the model using synthesizer.fit().)
  4. Data can then be sampled using synthesizer.sample().

Each synthesizer class also provides a series of methods to help users customize the transformations their data goes through. Read more about that here.

Notice that the preprocessing and model fitting steps can now be separated. This can be helpful if preprocessing is time consuming or if the data has been processed externally.

Other Highly Requested Features

Another major addition is control over randomization. In SDV <1.0, users could set a seed to control the randomization for only some columns. In SDV 1.0, randomization is controlled for all columns. Every new call to sample generates new data, but the synthesizer's seed can be reset to the original state using synthesizer.reset_randomization(), enabling reproducibility.

SDV 1.0 adds accessibility and transparency into the transformers used for preprocessing and underlying machine learning models.

  • Using the synthesizer.get_transformers() method, you can access the transformers used to preprocess each column and view their properties. This can be useful for debugging and accessing privacy information like mappings used to mask data.
  • Distribution parameters learned by copula models can be accessed using the synthesizer.get_learned_distributions() method.

PII handling is improved by the following features:

  • Primary keys can be set to natural sdtypes (eg. SSN, email, name). Previously they could only be numerical or text.
  • The PseudoAnonymizedFaker can be used to provide consistent mapping to PII columns. As mentioned before, the mapping itself can be accessed by viewing the transformers for the column using synthesizer.get_transformers().
  • A bug causing PII columns to slow down modeling is patched.

Finally, the synthetic data can now be easily evaluated using the evaluate_quality() and run_diagnostic() methods. The data can be compared visually to the actual data using the get_column_plot() and get_column_pair_plot() methods. For more info on how to visualize or interpret the synthetic data evaluation, read the docs here.

Issues Resolved

New Features

Bugs Fixed

  • In upgrade_metadata, PII values are being converted to generic categorical columns - Issue #1317 by @frances-h
  • PARSynthesizer is missing save and load methods - Issue #1289 by @amontanez24
  • Confusing warning when updating transformers - Issue #1272 by @frances-h
  • When adding constraints, auto_assign_transformers is showing columns that should no longer exist - Issue #1260 by @pvk-developer
  • Cannot fit twice if I modify transformers: ValueError: There are non-numerical values in your data. - Issue #1259 by @frances-h
  • Cannot fit twice if I add constraints: ValueError: There are non-numerical values in your data. - Issue #1258 by @frances-h
  • HMASynthesizer errors out when fitting a dataset that has a table which holds primary key and foreign keys only - Issue #1257 by @pvk-developer
  • Change ValueErrors to InvalidMetadataErrors - Issue #1251 by @frances-h
  • Multi-table should show foreign key transformers as None - Issue #1249 by @frances-h
  • Cannot use HMASynthesizer.fit_processed_data more than once (KeyError) - Issue #1240 by @frances-h
  • Function get_available_demos crashes if a dataset's num-tables or size-MB cannot be found - Issue #1215 by @amontanez24
  • Cannot supply a natural key to HMASynthesizer (where sdtype is custom): Error in sample - Issue #1214 by @pvk-developer
  • Unable to sample when using a PseudoAnonymizedFaker - Issue #1207 by @pvk-developer
  • Incorrect sdtype specified in demo dataset student_placements_pii - Issue #1206 by @amontanez24
  • Auto assigned transformers for datetime columns don't have the right parameters - Issue #1204 by @pvk-developer
  • Cannot apply Inequality constraint on demo dataset's datetime columns - Issue #1203 by @pvk-developer
  • pii should not be required to auto-assign faker transformers - Issue #1194 by @pvk-developer
  • Misc. bug fixes for SDV 1.0.0 - Issue #1193 by @pvk-developer
  • Small bug fixes in demo module - Issue #1192 by @pvk-developer
  • Foreign Keys are added as Alternate Keys when upgrading - Issue #1143 by @pvk-developer
  • Alternate keys not unique when assigned to a semantic type - Issue #1111 by @pvk-developer
  • Synthesizer errors if column is semantic type and pii is False - Issue #1110 by @fealho
  • Sampled values not unique if primary key is numerical - Issue #1109 by @pvk-developer
  • Validate not called during synthesizer creation - Issue #1105 by @pvk-developer
  • SingleTableSynthesizer fit doesn't update rounding - Issue #1104 by @amontanez24
  • Method auto_assign_tranformers always sets enforce_min_max_values=True - Issue #1095 by @fealho
  • Sampled context columns in PAR must be in the same order - Issue #1052 by @amontanez24
  • Incorrect schema version printing during detect_table_from_dataframe - Issue #1038 by @amontanez24
  • Same relationship can be added twice to MultiTableMetadata - Issue #1031 by @amontanez24
  • Miscellaneous metadata bugs - Issue #1026 by @amontanez24

Maintenance

Internal