Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark/test with perturbed data #26

Open
hardbyte opened this issue May 30, 2017 · 2 comments
Open

Benchmark/test with perturbed data #26

hardbyte opened this issue May 30, 2017 · 2 comments

Comments

@hardbyte
Copy link
Collaborator

hardbyte commented May 30, 2017

Yangfeng suggested looking at febrl to generate data with pertubations.

Manual - http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/

Additional test sets:

Aha! Link: https://csiro.aha.io/features/ANONLINK-76

@hardbyte
Copy link
Collaborator Author

Test Set which contains the following folders:

  • clean: datasets with duplicates, no modifiction. Datasets are named by clean_x_y_i, where x is the dataset size, y is the percentage of duplicates, and i is the dataset identifier (assuming two data sources).

  • dirty_l_m_n: datasets with corrupted duplicates, generated by Febrl with some post-processing to separate originals and duplicates. (l: the maximum number of duplicates for each original record; m: the maximum number of modifications in a field; n: the maximum number of modification in a record).

  • dirty_typo: datasets with corrupted duplicates. Only the 'Surname' field values are modified. The modification types are insertion, deletion and substitution; with equal probability; the error positions are randomly selected.

@wilko77
Copy link
Collaborator

wilko77 commented Sep 26, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants