Synthegrate

Synthegrate is a tool that can be used to generate integration scenarios with synthetic data. It was created with the purpose of evaluating mapping generation algorithms, but it can be used in various kinds of integration tests that need multiple source tables (in common or separate schemas) that merge through union and/or join operations in order to populate a target schema.

Getting Started

To run Synthegrate, you can simply download the files in the runnable_jar folder which contains the JAR file and the utility files that are needed to create the schemas and populate them with instances using Datafiller.

Prerequisites

For generating the scenario files: Java 8
[Optional] For materializing the data:
- PostgreSQL + a database management tool (such as pgAdmin)
- Python (for running datafiller.py)

Running

To create the synthetic scenario run the jar file:

$ java -jar Synthegrate.jar -p synthegrate.properties
To materialize the synthetic scenario:
1. create the postgres database. Note: the name needs to be the same as the one set in the properties file (database.name parameter).
2. in the parent folder of the generated scenario folder run:
$ ./create_db_schemas <database_name>

Note: Make sure that all the files in runnable_jar/to_materialize folder are in the same folder as the scenario folder. E.g.

working_directory
- scenario_1_0_5_6_100_300_2_0_2_2 (synthetic scenario folder)
- build_spicy_target_db
- create_db_schemas
- datafiller.py
- dataFillerRunner

Warning: When a new schema file is parsed, the schemas with the same name in the database will be dropped and created again. If you do not want to lose data, make sure you create backups of previously materialized scenarios if the schema names are the same in the scenario that is about to be materialized.

Output files

The generated output files are all contained within a separate folder for each scenario:

1 schema_schema_X.sql file for each source schema (can be more than 1 - depends on the scenario.schemas_number parameter)
1 schema_schema_X.sql file for the target schema (the number of target schemas is fixed to 1)
1 datafiller_schema_X.sql file for each source schema (used by Datafiller to populate the instances)
1 file for matches (which source attribute matches which target attribute). The format of a match is:
- Match(<source_schema_name>.<source_relation_name>.<source_attribute_name>,<target_schema_name>.<target_relation_name>.<target_attribute_name>,<confidence>).
1 SQL file for the ground truth (expected) mapping. This file is not executable as it contains two parts:
- the SQL script (that needs to be separated if run)
- the short description of the mapping - which becomes handy when the mapping tasks become very complicated.
1 file for linking the schema names to the corresponding database. An entry has the following format:
- datasource(<name_used_in_other_files>,localhost:5432/<database_name>,JDBC_POSTGRES,null).
- this will be generated according to the set database parameters in the properties file.
1 file for declaring which source schema needs to map which target schema. One entry in the file corresponds to each source schema that needs to be mapped:
- map_schemas(<source_schema_name>,<target_schema_name>).
- the names are the <name_used_in_other_files> declared in the datasource file
1 XML file containing a mapping task for ++Spicy (a mapping generation tool)
separate schema_X_spicy_source/target.sql files for creating the same tables in the public schema as ++Spicy does not recognize other schemas.
1 XML file for profiling data. This file contains information about (partial) inclusion dependencies and candidate keys.

synthegrate.properties file

Output files

Set the output folder:
- output.folder = <where to create the scenario folder>
PROFILE DATA filename (not full path)
- output.profile = <what is the name of the xml file that will contain profile data information>
MATCHES filename (not full path)
- output.match = <the name of the matches file>
DATASOURCE filename (not full path)
- output.sources = <the name of the file that will contain information about where in the database can the source be found>
SOURCE-TARGET filename
- output.source_target = <the name of the file that will contain information about which source schema needs mapping to which target schema>
EXPECTED MAPPING filename
- output.expected_mapping = <the name of the SQL file that will contain the ground truth (expected) mapping>
++Spicy mapping task filename
- output.spicy_maptask = <filename>

Input files

The path to the folder where the dictionaries can be found (or created if not found suitable). The dictionaries are used to create the tuple instances with datafiller.py.
- input.dictionaries_folder = <folder_path>

Scenario parameters

scenario.max_source_relations = <positive integer>
scenario.min_cardinality = <positive integer>
scenario.max_cardinality = <positive integer>
scenario.min_arity = <positive integer>
scenario.max_arity = <positive integer>
scenario.tries = <positive integer> (default value = 1000)
scenario.schemas_number = <positive integer>
scenario.target_relations = <positive integer>
if the target has more than one table, then set how many tables have a declared primary key.
- scenario.target_pks_number = <positive integer>
Number of maximum join operations:
- scenario.max_joins_number = <0 or positive integer>
Number of maximum join operations:
- scenario.max_unions_number = <0 or positive integer>
Explicit foreign keys ratio out of the total FK opportunities that are created. The FK opportunities can be created just by creating proper candidate keys and inclusion dependencies in the profile data file.
- scenario.fks_ratio = <between 0.0 and 1.0>
Disjoint unions ratio out of the total created (i.e., if the instances of the unionable relations are disjoint):
- scenario.disjoint_unions = <between 0.0 and 1.0>
Reuse already created attribute to split a relation in 2 relations linked through (explicit) FK.
- scenario.reuse_join_attributes = <between 0.0 and 1.0>
If the scenario should be created such that the union operations are expected to be performed at the end of the mapping (after several source relations have merged through join).
- scenario.create_unions_first = <true if unions are at the end, false if the unions are expected to be performed first>
The same as 15 but for join operations:
- scenario.create_joins_first = <true or false>
When creating join opportunities, the join can be of several types:
- scenario.star_join = <true or false>
- scenario.chain_join = <true or false>
- scenario.balanced_join = <true or false>

Database parameters

database.user=<username> (default value: postgres)
database.password=<password for username> (default value: postgres)
database.location = <ip> (default value: localhost)
database.port = <port> (default value: 5432)
database.name = <database name>
database.jdbc_driver = JDBC_POSTGRES (currently, POSTGRES is the only supported database, so it will not work with other parameters)
database.jdbc_prefix = jdbc:postgresql://

License

This project is licensed under the Apache 2.0 License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.settings		.settings
bin/uk/ac/man/synthegrate		bin/uk/ac/man/synthegrate
libs		libs
resources		resources
runnable_jar		runnable_jar
src/uk/ac/man/synthegrate		src/uk/ac/man/synthegrate
.classpath		.classpath
.project		.project
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

bin/uk/ac/man/synthegrate

bin/uk/ac/man/synthegrate

libs

libs

resources

resources

runnable_jar

runnable_jar

src/uk/ac/man/synthegrate

src/uk/ac/man/synthegrate

.classpath

.classpath

.project

.project

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

Synthegrate

Getting Started

Prerequisites

Running

Output files

synthegrate.properties file

Output files

Input files

Scenario parameters

Database parameters

License

About

Releases

Packages

Languages

License

MLacra/Synthegrate

Folders and files

Latest commit

History

Repository files navigation

Synthegrate

Getting Started

Prerequisites

Running

Output files

synthegrate.properties file

Output files

Input files

Scenario parameters

Database parameters

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages