Skip to content

aws-samples/aws-glue-test-data-generator

AWS Glue Test Data Generator for S3 Data Lakes and DynamoDB

Test data generation plays a critical role in evaluating system performance, validating accuracy, bug identification, enhancing reliability, assessing scalability, ensuring regulatory compliance, training machine learning models, and supporting CI/CD processes. It enables the discovery of potential issues and ensures that systems operate as intended across diverse scenarios.

The AWS Glue Test Data Generator provides a configurable framework for Test Data Generation using AWS Glue Pyspark serverless Jobs. The required test data description is fully configurable through a YAML configuration file.

Code Repository on Github

The source code and depolyment instruction are accessible through this link: Github Code Repository

Supported data types

The Test Data Generation Framework currently supports the following types:

  • Unique Key Generator

    This generator produces formatted unique values that can be used as partition key. you can specify a prefix to and the number of leading zeros if required.

  • Child Key Generator

    This generator produces a child key referencing the primary key. This is useful in generating multi-level hierarchical data. you can specify the number of levels and how many nodes you want to generate per level.

  • String Data Generator

    This generator produces String data type with various mechanisms:

    • Random Strings: you can specify the number of characters and the type of generated characters: numeric, alphabetic or alphanumeric values. This can be used for generating random serial numbers, ordinal data, codes, identity numbers, .. etc.

    • Strings from a Dictionary: you can provide a dictionary of words to pick up randomly by the generator. This can be used to generate categorical columns with predefined set of values such as order status, product types, marital status, gender,..etc/

    • Strings from a Pattern: you can provide generic pattern for your string data. This can be used to generate fake emails, formatted phone numbers, comments, address like data, …etc.

  • Integer Data Generator

    This generator produces random integer data from a specified range.

  • Float/Double Data Generator

    This generator produces random float/double data from an expression. This can be used to generate float values such as salary, temperature, profit, statistical data,.. etc

  • Internet Address Data Generator

    This generator produces random IP addresses. This can be used to generate IP address ranges for testing applications used for internet traffic monitoring or filtering.

  • Date Data Generator

    This generator produces random dates generator from a configurable date range.

  • Close Date Data Generator

    This generator produces random from a configurable start date column and a range. This can be used to generate dates of specific intervals such as a support ticket close date, deceased date, expiration date,… etc

Solution Architecture

image

The Test Data Generator is based on PySpark library which is invoked through as a PySpark AWS Glue job. All configurations to the generator is configured through a YAML formatted file stored in the S3 artefact bucket. The deployment to AWS account is done by using AWS Cloud Development Kit (CDK)

  1. AWS CDK generates the CloudFromation template and deploy it in the hosting AWS Account

  2. Cloudfromation creates:

    1. The artefacts S3 Bucket and uploads the TDG PySpark library and YAML configuration file into it.

    2. The TDG PySpark glue Job

    3. The Service IAM role required by TDG PySpark glue Job.

  3. The TDG PySpark glue Job is invoked to generate the test data.

Deployment

  1. Clone the GitHub repository in your local development environment

  2. Set the following environment variables:

AWS_ACCOUNT to the AWS account id where you intend to deploy the Test Data Generator

AWS_REGION to the AWS region id where you intend to deploy the Test Data Generator

  1. Use aws configure to configure the AWS CLI with the access key to the AWS account
  2. If the account is not CDK bootstrapped, you need to run the following command:

cdk bootstrap

  1. open a terminal in the workspace path and run the following CDK command to deploy the solution

$<workspace-path>/AWSGluePysparkTDG> cdk deploy

Configuration

Configuration File

The Test Data Generator is configured through the YAML file TDG_configuration_file.yml found in the artefacts bucket at the following path:

s3://tdg-artefacts-<account-id>/tgd_glue_job/Config/TDG_configuration_file.yml

Configuration Parameters

number_of_generated_records

Number of desired generated records

attributes_list

Descriptor of the generated record fields/columns. You can configure the following data types:

  • Unique Key Generator

ColumnName: Column name

Generator: key_generator

DataDescriptor:

Prefix: (optional) prefix to the key generated values

LeadingZeros: (optional) number of digits formatting the key values. Key values are prefixed by leading zeros to generated a fixed number of digits

  • Child Key Generator

ColumnName: Column name

Generator: child_key_generator

DataDescriptor:

Prefix: prefix should match the parent key prefix

LeadingZeros: should match the parent key LeadingZero

ChildCountPerSublevel: a list of number of nodes per hierarchy sub-levels. For example, the following list describes three levels of hierarchy with level 1 has 10 nodes, level 2 has 100 nodes and level 3 has 1000 nodes.

   - 10
   - 100
   - 1000
  • String Data Generator

1. Strings from a Dictionary

ColumnName: Column name

Generator: string_generator

DataDescriptor:

Values: a list of string values.

2. Strings from a Pattern

ColumnName: Column name

Generator: string_generator

DataDescriptor:

Pattern: a pattern of expressions separated by #. available expressions:

  1. Constant strings: can be any constant string such as: Contact Details, @, Title:, ..etc
  2. Random Numbers: ^N for example to specify 8 digits: ^N8
  3. Random Alphabetic Strings: ^A for example to specify a random string of length 10 charters: ^A10
  4. Random Alphanumeric Strings: ^x for example to specify a random alphanumeric string of length 5 charters: ^X5

Example, the following pattern

Contact Details: Email: #^X8#__#^N2#@#^A4#.#^A3# Phone: #^N8"

will result in the following sample values:

Contact Details: Email: dTJeG0vO__65@rAeF.Dsh Phone: 9643728

Contact Details: Email: H8bmzlVP__8@KlVQ.Swc Phone: 84716259

Contact Details: Email: FAoNEfDV__6@HAYI.Jkp Phone: 4651938

3. Random Strings

ColumnName: Column name

Generator: string_generator

DataDescriptor:

Random: 'True'

NumChar: length of generated alphanumeric strings

  • Integer Data Generator

ColumnName: Column name

Generator: integer_generator

DataDescriptor:

Range: lower value, upper value

  • Float/Double Data Generator

ColumnName: Column name

Generator: float_generator

DataDescriptor:

** Expression**: SQL expression such as: rand(42) * 3000

  • Date Data Generator

ColumnName: Column name

Generator: date_generator

DataDescriptor:

StartDate: start date of the date range on the format DD/MM/YYYY

EndDate: end date of the date range on the format DD/MM/YYYY

  • Close Date Data Generator

ColumnName: Column name

Generator: close_date_generator

DataDescriptor:

StartDateColumnName: column name of the generated open date

CloseDateRangeInDays: maximum span form the open date in days

  • Internet Address Data Generator

ColumnName: Column name

Generator: ip_address_generator

DataDescriptor:

IpRanges: list of ranges for the IP address four numeric parts on the form of lower value, upper value. For example:

 - 9,10
 - 1,254
 - 1,128
 - 2,20

target_list

the list of targets for the generator. The generator will perform automatic data types conversion for every specified target. Currently, the generator supports the following targets:

  • S3 Buckets

target: S3

attributes:

BucketArn: S3 Bucket arn including the prefix

mode: s3 bucket writing mode (overwrite, append)

header: include header in the generated data (True, Flase)

delimiter: CSV file delimeter

  • DynamoDB tables

target: Dynamodb

attributes:

dynamodb.output.tableName: dynamodb table name

dynamodb.throughput.write.percent: throughput write percent

Invocation

From the AWS Glue Console:

  1. Navigate to Data Integration and ETL>AWS Glue Studio]>Jobs
  2. Select the TestDataGeneratorJob job and press Run Job
  3. Once the job completes successfully, check for the generated data in the configured targets.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Contributors