PAKDD-Class-Ratio

Initialize the Python Environment

The python requirements are stored in requirements.txt. Initialize the python environment with:

pip install -r requirements.txt

Train the Random Forest Classifiers

The notebook random_forest.ipynb generates the random forest models and caches the results.

Train the Support Vector Classifiers

The notebook svm_rbf.ipynb generates support vector classifiers with a radial basis function kernel and caches the results. These are the SVM results published in the paper.

The notebook svm_linear.ipynb generates support vector classifiers with a linear kernel and caches the results.

Train the Entity Matching Transformers

To train the entity matching transformer (EMT) models, first generate the formatted data with the script prepare_emt_data.py:

python prepare_emt_data.py

You can then train each EMT model with the following command. This should be run within the entity-matching-transformer directory, in a separate python environment with python 3.8.10 and the requirements in entity-matching-transformer/requirements.txt.

./train [data]-[ratio]-[fold_index]

The three parameters in the command are restricted to:

data $\in \{\text{abt-buy}, \text{amazon-google}, \text{walmart-amazon}, \text{wdc\_xlarge\_computers}, \text{wdc\_xlarge\_shoes}, \text{wdc\_xlarge\_watches}\}$
ratio $\in \{1,2,3,4,5\}$
fold_index $\in \{0,1,2,3,4,5,6,7,8,9\}$

For example, to train the first of the 10 fold configurations on a 1:2 ratio for the abt-buy data:

./train abt-buy-2-0

Display the Results

To render the results shown in the paper (Figure 2, Figure 3, Figure 4, and Table 2), see create_figures.ipynb. This also contains the precision and recall results for the datasets not shown in Figure 3. Note: Figure 3 incorrectly is labeled as the results for the amazon-google dataset, but actually shows the results for the wdc_xlarge_watches dataset.

To view the $F_1$ measure results for when the classification threshold is fixed to 0.5, see fixed_threshold_results.ipynb. This notebook also contains results for the Matthews correlation coefficient (MCC) and accuracy. We only compute these two additional measures for the fixed classification threshold of 0.5.

Reported accuracy improves as the class ratio grows, contrary to $F_1$ measure and MCC. This is because accuracy is computed on both classes (not just the matches) and the increase in the number of correctly identified non-matches is greater than the decrease in the number of correctly identified matches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

entity-matching-transformer

entity-matching-transformer

.gitignore

.gitignore

README.md

README.md

additional_emt_results.ipynb

additional_emt_results.ipynb

cache.py

cache.py

create_figures.ipynb

create_figures.ipynb

fixed_threshold_results.ipynb

fixed_threshold_results.ipynb

prepare_emt_data.py

prepare_emt_data.py

random_forest.ipynb

random_forest.ipynb

requirements.txt

requirements.txt

svm_linear.ipynb

svm_linear.ipynb

svm_rbf.ipynb

svm_rbf.ipynb

Repository files navigation

PAKDD-Class-Ratio

Initialize the Python Environment

Train the Random Forest Classifiers

Train the Support Vector Classifiers

Train the Entity Matching Transformers

Display the Results

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
entity-matching-transformer		entity-matching-transformer
.gitignore		.gitignore
README.md		README.md
additional_emt_results.ipynb		additional_emt_results.ipynb
cache.py		cache.py
create_figures.ipynb		create_figures.ipynb
fixed_threshold_results.ipynb		fixed_threshold_results.ipynb
prepare_emt_data.py		prepare_emt_data.py
random_forest.ipynb		random_forest.ipynb
requirements.txt		requirements.txt
svm_linear.ipynb		svm_linear.ipynb
svm_rbf.ipynb		svm_rbf.ipynb

foxcroftjn/PAKDD-Class-Ratio

Folders and files

Latest commit

History

Repository files navigation

PAKDD-Class-Ratio

Initialize the Python Environment

Train the Random Forest Classifiers

Train the Support Vector Classifiers

Train the Entity Matching Transformers

Display the Results

About

Topics

Resources

Stars

Watchers

Forks

Languages