Decoy - A Synthetic Data Generator

Synthetic data generation with DuckDB

Decoy is a Synthetic Data Generator using DuckDB at it's core. The main aim of the project is to reduce the overhead in creating synthetic data and provide all the tools to easily test your synthetic data queries.

It has integrated well known Python based synthetic data generators such as Faker, Mimesis and Random.

The power of this application comes from the ability to write scripts to repeatably generate data, to use interactive tools to test your queries and the ability to create tables which interact, allowing for the creation of full relational databases.

Governance and Data Security

This tool generates data randomly, it does not require access to the real data and it does not learn from real data. This is a deliberate design decision, to ensure that the data generated by this tool can be used safely in non-production environments.

The only prior information it requires to make analoguous synthetic data is the names of the tables in the database, the column names and any relationships between the tables.

If you require synthetic data which mimics real data distributions to the point where you can do analytics or machine learning, this is not the tool you want!

Requirements & Setup

#Requires Python 3.10+
pip install --editable .

OR

#Requires Python 3.10+
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt

Data Generation Tools

Decoy exposes multiple command line tools to generate data and test generating data

Generate Data From a .sql File

The main core of decoy is the ability to repeatibly generate synthetic data from a .sql file.

decoy exec examples/basic.sql

There are multiple examples in the ./examples folder.

Interactive Testing Playgrounds

In addtion to a CLI to run .sql files. Decoy also provides an interactive environment to easily build your queries and quickly test that they are working correctly.

Jupyter Lab Kernel

Jupyter Lab is a web-based interactive repl for which we have created a decoy kernel allowing you to run decoy SQL queries and quickly observe their output

#install the kernel for first-time use
decoy kernel-install

#Running the Jupyter lab web server
jupyter lab

Then select decoy as the kernel when starting a new notebook instance.

Some of the files in ./examples are in the *.ipynb, these are to be read using the Jupyter kernel. Jupyter also integrates markdown too - so if required for documentation, you can also explain though processes.

Use the SQL REPL

Decoy also includes a command-line REPL too which provides the same interactive testing environment but on the command line.

decoy repl

Please not that the query must end with a ; and to execute the query select <Alt/Opt><Enter> as multiline queires are supported.

Parse A SQL Schema Definition

./cli sqlparse './examples/test_schema.sql' './examples/schema_parse_output.sql' 5

./cli exec ./examples/schema_parse_output.sql

** At the moment the kernel only works if the notebook is in the root of this project. Something needs fixing with the kernel path.

Data generation functions

The python Faker library is exposed for the en-GB locale, the registration function can easily be modified to include more locales.

select faker_name() from range(20);

The python Mimesis library is exposed for the en-GB locale, the registration function can easily be modified to include more locales.

select mimesis_person_full_name() from range(20);

There is a reverse regex function for generating text patterns.

SELECT xeger('\(\+0[0-9]{2}\) 07[0-9]{3} [0-9]{6}') as tel FROM range(20);

Functions are also available for intra-table sampling, column shuffling, oversampling from other tables.

-- Create a names table
CREATE TABLE name_table AS SELECT range as id, faker_name() as name FROM range(1000);

-- Select 10k names randomly from the name table.
SELECT oversample('name_table', 'name') from range(10000);

SELECT name, shuffle(name) as name_shuffled, intratable_sample(name) as name_sampled FROM name_table;

You can also generate imperfect data by using some of the data messer functions which will either add NULL or junk data. These are currently hardcoded and the functions are in ./decoy/udf_arrow.py

CREATE OR REPLACE TABLE names_full AS (
    SELECT
        faker_name() as full_name,
    FROM range(30)
);

UPDATE names_full SET full_name = messy_data_nullifier(full_name);

Exporting Data

DuckDB handles multiple export functions. Including CSV and Parquet

COPY (SELECT faker_name() FROM range(5)) TO 'export_test.csv' WITH (HEADER 1, DELIMITER ',');

COPY (SELECT faker_name() FROM range(5)) TO 'export_parquet.parquet';

Unit tests

PYTHONPATH=. pytest

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
decoy		decoy
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
notebook.ipynb		notebook.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

License

EqualExperts/decoy

Folders and files

Latest commit

History

Repository files navigation

Decoy - A Synthetic Data Generator

Governance and Data Security

Requirements & Setup

Data Generation Tools

Generate Data From a .sql File

Interactive Testing Playgrounds

Jupyter Lab Kernel

Use the SQL REPL

Parse A SQL Schema Definition

Data generation functions

Exporting Data

Unit tests

About

Resources

License

Stars

Watchers

Forks

Languages