Benchmark/test with perturbed data #26

hardbyte · 2017-05-30T01:42:16Z

Yangfeng suggested looking at febrl to generate data with pertubations.

Additional test sets:

hardbyte · 2017-07-11T23:16:51Z

Test Set which contains the following folders:

clean: datasets with duplicates, no modifiction. Datasets are named by clean_x_y_i, where x is the dataset size, y is the percentage of duplicates, and i is the dataset identifier (assuming two data sources).
dirty_l_m_n: datasets with corrupted duplicates, generated by Febrl with some post-processing to separate originals and duplicates. (l: the maximum number of duplicates for each original record; m: the maximum number of modifications in a field; n: the maximum number of modification in a record).
dirty_typo: datasets with corrupted duplicates. Only the 'Surname' field values are modified. The modification types are insertion, deletion and substitution; with equal probability; the error positions are randomly selected.

wilko77 · 2017-09-26T06:07:51Z

Provide feedback