Skip to content

Releases: sdv-dev/SDV

v0.16.0 - 2022-07-21

22 Jul 03:02
Compare
Choose a tag to compare

This release brings user friendly improvements and bug fixes on the SDV constraints, to help
users generate their synthetic data easily.

Some predefined constraints have been renamed and redefined to be more user friendly & consistent.
The custom constraint API has also been updated for usability. The SDV now automatically determines
the best handling_strategy to use for each constraint, attempting transform by default and
falling back to reject_sampling otherwise. The handling_strategy parameters are no longer
included in the API.

Finally, this version of SDV also unifies the parameters for all sampling related methods for
all models (including TabularPreset).

Changes to Constraints

  • GreatherThan constraint is now separated in two new constraints: Inequality, which is
    intended to be used between two columns, and ScalarInequality, which is intended to be used
    between a column and a scalar.

  • Between constraint is now separated in two new constraints: Range, which is intended to
    be used between three columns, and ScalarRange, which is intended to be used between a column
    and low and high scalar values.

  • FixedIncrements a new constraint that makes the data increment by a certain value.

  • New create_custom_constraint function available to create custom constraints.

Removed Constraints

  • Rounding Rounding is automatically being handled by the rdt.HyperTransformer.
  • ColumnFormula the create_custom_constraint takes place over this one and allows more
    advanced usage for the end users.

New Features

  • Improve error message for invalid constraints - Issue #801 by @fealho
  • Numerical Instability in Constrained GaussianCopula - Issue #806 by @fealho
  • Unify sampling params for reject sampling - Issue #809 by @amontanez24
  • Split GreaterThan constraint into Inequality and ScalarInequality - Issue #814 by @fealho
  • Split Between constraint into Range and ScalarRange - Issue #815 @pvk-developer
  • Change columns to column_names in OneHotEncoding and Unique constraints - Issue #816 by @amontanez24
  • Update columns parameter in Positive and Negative constraint - Issue #817 by @fealho
  • Create FixedIncrements constraint - Issue #818 by @amontanez24
  • Improve datetime handling in ScalarInequality and ScalarRange constraints - Issue #819 by @pvk-developer
  • Support strict boundaries even when transform strategy is used - Issue #820 by @fealho
  • Add create_custom_constraint factory method - Issue #836 by @fealho

Internal Improvements

Bugs Fixed

  • Numerical Instability in Constrained GaussianCopula - Issue #801 by @tlranda and @fealho
  • Fix error message for FixedIncrements - Issue #865 by @pvk-developer
  • Fix constraints with conditional sampling - Issue #866 by @amontanez24
  • Fix error message in ScalarInequality - Issue #868 by @pvk-developer
  • Cannot use max_tries_per_batch on sample: TypeError: sample() got an unexpected keyword argument 'max_tries_per_batch' - Issue #885 by @amontanez24
  • Conditional sampling + batch size: ValueError: Length of values (1) does not match length of index (5) - Issue #886 by @amontanez24
  • TabularPreset doesn't support new sampling parameters - Issue #887 by @fealho
  • Conditional Sampling: batch_size is being set to None by default? - Issue #889 by @amontanez24
  • Conditional sampling using GaussianCopula inefficient when categories are noised - Issue #910 by @amontanez24

Documentation Changes

v0.15.0 - 2022-05-25

25 May 20:52
Compare
Choose a tag to compare

This release improves the speed of the GaussianCopula model by removing logic that previously searched for the appropriate distribution to use. It also fixes a bug that was happening when conditional sampling was used with the TabularPreset.

The rest of the release focuses on making changes to improve constraints including changing the UniqueCombinations constraint to FixedCombinations, making the Unique constraint work with missing values and erroring when null values are seen in the OneHotEncoding constraint.

New Features

  • Silence warnings coming from univariate fit in copulas - Issue #769 by @pvk-developer
  • Remove parameters related to distribution search and change default - Issue #767 by @fealho
  • Update the UniqueCombinations constraint - Issue #793 by @fealho
  • Make Unique constraint works with nans - Issue #797 by @fealho
  • Error out if nans in OneHotEncoding - Issue #800 by @amontanez24

Bugs Fixed

  • Unable to sample conditionally in Tabular_Preset model - Issue #796 by @katxiao

Documentation Changes

  • Support GPU computing and progress track? - Issue #478 by @fealho

v0.14.1 - 2022-05-03

03 May 16:21
Compare
Choose a tag to compare

This release adds a TabularPreset, available in the sdv.lite module, which allows users to easily optimize a tabular model for speed.
In this release, we also include bug fixes for sampling with conditions, an unresolved warning, and setting field distributions. Finally,
we include documentation updates for sampling and the new TabularPreset.

Bugs Fixed

  • Sampling with conditions={column: 0.0} for float columns doesn't work - Issue #525 by @shlomihod and @tssbas
  • resolved FutureWarning with Pandas replaced append by concat - Issue #759 by @Deathn0t
  • Field distributions bug in CopulaGAN - Issue #747 by @katxiao
  • Field distributions bug in GaussianCopula - Issue #746 by @katxiao

New Features

  • Set default transformer to categorical_fuzzy - Issue #768 by @amontanez24
  • Model nulls normally when tabular preset has constraints - Issue #764 by @katxiao
  • Don't modify my metadata object - Issue #754 by @amontanez24
  • Presets should be able to handle constraints - Issue #753 by @katxiao
  • Change preset optimize_for --> name - Issue #749 by @katxiao
  • Create a speed optimized Preset - Issue #716 by @katxiao

Documentation Changes

v0.14.0 - 2022-03-21

21 Mar 15:38
Compare
Choose a tag to compare

This release updates the sampling API and splits the existing functionality into three methods - sample, sample_conditions,
and sample_remaining_columns. We also add support for sampling in batches, displaying a progress bar when sampling with more than one batch,
sampling deterministically, and writing the sampled results to an output file. Finally, we include fixes for sampling with conditions
and updates to the documentation.

Bugs Fixed

  • Fix write to file in sampling - Issue #732 by @katxiao
  • Conditional sampling doesn't work if the model has a CustomConstraint - Issue #696 by @katxiao

New Features

  • Updates to GaussianCopula conditional sampling methods - Issue #729 by @katxiao
  • Update conditional sampling errors - Issue #730 by @katxiao
  • Enable Batch Sampling + Progress Bar - Issue #693 by @katxiao
  • Create sample_remaining_columns() method - Issue #692 by @katxiao
  • Create sample_conditions() method - Issue #691 by @katxiao
  • Improve sample() method - Issue #690 by @katxiao
  • Create Condition object - Issue #689 by @katxiao
  • Is it possible to generate data with new set of primary keys? - Issue #686 by @katxiao
  • No way to fix the random seed? - Issue #157 by @katxiao
  • Can you set a random state for the sdv.tabular.ctgan.CTGAN.sample method? - Issue #515 by @katxiao
  • generating different synthetic data while training the model multiple times. - Issue #299 by @katxiao

Documentation Changes

  • Typo in the document documentation - Issue #680 by @katxiao

v0.13.1 - 2021-12-22

22 Dec 20:35
Compare
Choose a tag to compare

This release adds support for passing tabular constraints to the HMA1 model, and adds more explicit error handling for
metric evaluation. It also includes a fix for using categorical columns in the PAR model and documentation updates
for metadata and HMA1.

Bugs Fixed

  • Categorical column after sequence_index column - Issue #314 by @fealho

New Features

  • Support passing tabular constraints to the HMA1 model - Issue #296 by @katxiao
  • Metric evaluation error handling metrics - Issue #638 by @katxiao

Documentation Changes

  • Make true/false values lowercase in Metadata Schema specification - Issue #664 by @katxiao
  • Update docstrings for hma1 methods - Issue #642 by @katxiao

v0.13.0 - 2021-11-22

22 Nov 21:06
Compare
Choose a tag to compare

This release makes multiple improvements to different Constraint classes. The Unique constraint can now
handle columns with the name index and no longer crashes on subsets of the original data. The Between
constraint can now handle columns with nulls properly. The memory of all constraints was also improved.

Various other features and fixes were added. Conditional sampling no longer crashes when the num_rows argument
is not provided. Multiple localizations can now be used for PII fields. Scaffolding for integration tests was added
and the workflows now run pip check.

Additionally, this release adds support for Python 3.9!

Bugs Fixed

  • Gaussian Copula – Memory Issue in Release 0.10.0 - Issue #459 by @xamm
  • Applying Unique Constraint errors when calling model.fit() on a subset of data - Issue #610 by @xamm
  • Calling sampling with conditions and without num_rows crashes - Issue #614 by @xamm
  • Metadata.visualize with path parameter throws AttributeError - Issue #634 by @xamm
  • The Unique constraint crashes when the data contains a column called index - Issue #616 by @xamm
  • The Unique constraint cannot handle non-default index - Issue #617 by @xamm
  • ConstraintsNotMetError when applying Between constraint on datetime columns containing null values - Issue #632 by @katxiao

New Features

  • Adds Multi localisations feature for PII fields defined in #308 - PR #609 by @xamm

Housekeeping Tasks

Internal Improvements

Documentation Changes

  • Anonymizing PII in single table tutorials states address field as e-mail type - Issue #604 by @xamm

Special thanks to @xamm, @katxiao, @pvk-developer and @amontanez24 for all the work that made this release possible!

v0.12.1 - 2021-10-12

12 Oct 19:44
Compare
Choose a tag to compare

This release fixes bugs in constraints, metadata behavior, and SDV documentation. Specifically, we added
proper handling of data containing null values for constraints and timeseries data, and updated the
default metadata detection behavior.

Bugs Fixed

  • ValueError: The parameter loc has invalid values - Issue #353 by @fealho
  • Gaussian Copula is generating different data with metadata and without metadata - Issue #576 by @katxiao
  • Make pomegranate an optional dependency - Issue #567 by @katxiao
  • Small wording change for Question Issue Template - Issue #571 by @katxiao
  • ConstraintsNotMetError when using GreaterThan constraint with datetime - Issue #590 by @katxiao
  • GreaterThan constraint crashing with NaN values - Issue #592 by @katxiao
  • Null values in GreaterThan constraint raises error - Issue #589 by @katxiao
  • ColumnFormula raises ConstraintsNotMetError when checking NaN values - Issue #593 by @katxiao
  • GreaterThan constraint raises TypeError when using datetime - Issue #596 by @katxiao
  • Fix repository language - Issue #464 by @fealho
  • Update init.py - Issue #578 by @dyuliu
  • IndexingError: Unalignable boolean - Issue #446 by @fealho

v0.12.0 - 2021-08-17

19 Aug 05:29
Compare
Choose a tag to compare

This release focuses on improving and expanding upon the existing constraints. More specifically, the users can now
(1) specify multiple columns in Positive and Negative constraints, (2) use the new Uniqueconstraint and
(3) use datetime data with the Between constraint. Additionaly, error messages have been added and updated
to provide more useful feedback to the user.

Besides the added features, several bugs regarding the UniqueCombinations and ColumnFormula constraints have been fixed,
and an error in the metadata.json for the student_placements dataset was corrected. The release also added documentation
for the fit_columns_model which affects the majority of the available constraints.

New Features

  • Change default fit_columns_model to False - Issue #550 by @katxiao
  • Support multi-column specification for positive and negative constraint - Issue #545 by @sarahmish
  • Raise error when multiple constraints can't be enforced - Issue #541 by @amontanez24
  • Create Unique Constraint - Issue #532 by @amontanez24
  • Passing invalid conditions when using constraints produces unreadable errors - Issue #511 by @katxiao
  • Improve error message for ColumnFormula constraint when constraint column used in formula - Issue #508 by @katxiao
  • Add datetime functionality to Between constraint - Issue #504 by @katxiao

Bugs Fixed

  • UniqueCombinations constraint with handling_strategy = 'transform' yields synthetic data with nan values - Issue #521 by @katxiao and @csala
  • UniqueCombinations constraint outputting wrong data type - Issue #510 by @katxiao and @csala
  • UniqueCombinations constraint on only one column gets stuck in an infinite loop - Issue #509 by @katxiao
  • Conditioning on a non-constraint column using the ColumnFormula constraint - Issue #507 by @katxiao
  • Conditioning on the constraint column of the ColumnFormula constraint - Issue #506 by @katxiao
  • Update metadata.json for duration of student_placements dataset - Issue #503 by @amontanez24
  • Unit test for HMA1 when working with a single child row per parent row - Issue #497 by @pvk-developer
  • UniqueCombinations constraint for more than 2 columns - Issue #494 by @katxiao and @csala

Documentation Changes

  • Add explanation of fit_columns_model to API docs - Issue #517 by @katxiao

v0.11.0 - 2021-07-12

12 Jul 22:44
Compare
Choose a tag to compare

This release primarily addresses bugs and feature requests related to using constraints for the single-table models. Users can now enforce scalar comparison with the existing GreaterThan constraint and apply 5 new constraints: OneHotEncoding, Positive, Negative, Between and Rounding. Additionally, the SDV will now auto-apply constraints for rounding numerical values, and for keeping the data within the observed bounds. All related user guides are updated with the new functionality.

New Features

  • Add OneHotEncoding Constraint - Issue #303 by @fealho
  • GreaterThan Constraint should apply to scalars - Issue #410 by @amontanez24
  • Improve GreaterThan constraint - Issue #368 by @amontanez24
  • Add Non-negative and Positive constraints across multiple columns- Issue #409 by @amontanez24
  • Add Between values constraint - Issue #367 by @fealho
  • Ensure values fall within the specified range - Issue #423 by @amontanez24
  • Add Rounding constraint - Issue #482 by @katxiao
  • Add rounding and min/max arguments that are passed down to the NumericalTransformer - Issue #491 by @amontanez24

Bugs Fixed

  • GreaterThan constraint between Date columns rasises TypeError - Issue #421 by @amontanez24
  • GreaterThan constraint's transform strategy fails on columns that are not float - Issue #448 by @amontanez24
  • AttributeError on UniqueCombinations constraint with non-strings - Issue #196 by @katxiao
  • Use reject sampling to sample missing columns for constraints - Issue #435 by @amontanez24

Documentation Changes

  • Ensure privacy metrics are available in the API docs - Issue #458 by @fealho
  • Ensure formula constraint is called ColumnFormula everywhere in the docs - Issue #449 by @fealho

v0.10.1 - 2021-06-10

11 Jun 01:48
Compare
Choose a tag to compare

This release changes the way we sample conditions to not only group by the conditions passed by the user, but also by the transformed conditions that result from them.

Issues resolved

  • Conditionally sampling on variable in constraint should have variety for other variables - Issue #440 by @amontanez24