product-matching

The code of Team Rhinobird for Mining the Web of HTML-embedded Product Data Task One at ISWC2020.

Task one: Product Matching

The product matching task aims to identify that if a pair of product deriving from different websites refer to the same product or not.

Datasets

In the SWC2020 challenge product matching task, the dataset of Task one is sampled from the WDC product data corpus. Products in the corpus are described by these properties: id, cluster id, category, title, description, brand, price, and specification table. Our models are mainly trained on two different matching dataset:

Computers dataset is provided by the organizers of the challenge which only contains product from Computers & Accessories.
All dataset contains products from all the four categories (Computers & Accessories, Camera & Photo, Watches, and Shoes).

Input

Although products are described by many attributes, most of the fields contain NULL values. Considering the filling rate and the input length, we focus on the title and description attributes and ignore the other ones.

Model

We use BERT_base as the main module of our matching model. Focal loss is adopted to alleviate class imbalance problem.

Please download the dataset and BERT weights first.

Just run the train.py to train all the models we used in the challenge:

python train.py

After obtaining the model parameters, run the predict.py to combine the predictions of different model and get the final results:

python predict.py

Post-processing

For test pairs with prediction results of 1 but different categories, we directly correct their results to 0 in the post-processing phase.

Results

Validation

Single model:

Model	Input	Dataset	F1	Post F1
Bert_focal	title	All	0.9481	0.9496
Bert_focal	title+description	All	0.9384	0.9411
Bert_focal	title+description	Computers	0.9700	0.9700

Test

In the final evaluation, we ensemble these three models:

Model	Precision	Recall	F1
Our model	0.8063	0.9200	0.8594

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data_process		data_process
layers		layers
models		models
trainer		trainer
.gitignore		.gitignore
README.md		README.md
predict.py		predict.py
results.csv		results.csv
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

data_process

data_process

layers

layers

models

models

trainer

trainer

.gitignore

.gitignore

README.md

README.md

predict.py

predict.py

results.csv

results.csv

train.py

train.py

Repository files navigation

product-matching

Datasets

Input

Model

Post-processing

Results

Validation

Test

About

Releases

Packages

Languages

englishbook/product-matching

Folders and files

Latest commit

History

Repository files navigation

product-matching

Datasets

Input

Model

Post-processing

Results

Validation

Test

About

Topics

Resources

Stars

Watchers

Forks

Languages