The python requirements are stored in requirements.txt. Initialize the python environment with:
pip install -r requirements.txt
The notebook random_forest.ipynb generates the random forest models and caches the results.
The notebook svm_rbf.ipynb generates support vector classifiers with a radial basis function kernel and caches the results. These are the SVM results published in the paper.
The notebook svm_linear.ipynb generates support vector classifiers with a linear kernel and caches the results.
To train the entity matching transformer (EMT) models, first generate the formatted data with the script prepare_emt_data.py:
python prepare_emt_data.py
You can then train each EMT model with the following command. This should be run within the entity-matching-transformer directory, in a separate python environment with python 3.8.10 and the requirements in entity-matching-transformer/requirements.txt.
./train [data]-[ratio]-[fold_index]
The three parameters in the command are restricted to:
- data
$\in \{\text{abt-buy}, \text{amazon-google}, \text{walmart-amazon}, \text{wdc\_xlarge\_computers}, \text{wdc\_xlarge\_shoes}, \text{wdc\_xlarge\_watches}\}$ - ratio
$\in \{1,2,3,4,5\}$ - fold_index
$\in \{0,1,2,3,4,5,6,7,8,9\}$
For example, to train the first of the 10 fold configurations on a 1:2 ratio for the abt-buy data:
./train abt-buy-2-0
To render the results shown in the paper (Figure 2, Figure 3, Figure 4, and Table 2), see create_figures.ipynb. This also contains the precision and recall results for the datasets not shown in Figure 3. Note: Figure 3 incorrectly is labeled as the results for the amazon-google dataset, but actually shows the results for the wdc_xlarge_watches dataset.
To view the
Reported accuracy improves as the class ratio grows, contrary to