DIAL

Implementation of

Deep Indexed Active Learning for Matching Heterogeneous Entity Representations. Arjit Jain, Sunita Sarawagi, Prithviraj Sen

Traditional methods for Active Learning Pairwise classification tasks follow a pipeline as described: In each iteration, the learning algorithm (learner) learns a matcher (shown in an ellipse which we use to denote model components) from labeled data 𝑇, the labeled pairs collected from the (human) labeler so far, while the example selector (selector) chooses the most informative unlabeled pairs to acquire labels for. After including the new labels into 𝑇, the process repeats until we learn a matcher of sufficient quality.

Our proposed integrated matcher-blocker combination and new AL workflow as shown. Compared to the previous diagram, the two most notable differences are

the blocker (dashed box) is now part of the AL feedback loop, and
the matcher is a component within the blocker. As base matcher, we use transformer-based pretrained language models (TPLM) which have recently led to excellent ER accuracies in the passive (non-AL) settings.

Getting Started

Environment

This code has been tested on a machine with 64 2.10GHz Intel Xeon Silver 4216 CPUs with 1007GB RAM and a single NVIDIA Titan Xp 12 GB GPU with CUDA 10.2 running Ubuntu 18.04

Reproducing the Experiments

The first step is to get the data. We provide the data used in DeepMatcher experiments (Link1 Link2 Link3) The multilingual data can be downloaded from salesforce/localization-xml-mt

cd MultiLingual
git clone https://github.com/salesforce/localization-xml-mt.git

Now create a virtual environment using conda

conda create -n DIAL_env
conda activate DIAL_env
conda install -y -c conda-forge -c pytorch pytorch==1.6 cudatoolkit=10.2
pip install faiss-cpu transformers scikit-learn pandas

Use run_single.sh to run DIAL. Example

bash run_single.sh DIAL amazon_google_exp

To evaluate on Test, run

bash run_eval.sh Eval-Test DIAL amazon_google_exp

and to evaluated on All Pairs, run

bash run_eval.sh Eval-AllPairs DIAL amazon_google_exp

Currently supports : Walmart-Amazon, Amazon-Google, DBLP-ACM, DBLP-Google Scholar, Abt-Buy

To run experiments with the multilingual dataset,

cd MultiLingual
bash run_multilingual_expts.sh DIAL-Multilingual

Citation

If you use this code for your research, please consider citing our arXiv preprint

@misc{jain2021deep,
      title={Deep Indexed Active Learning for Matching Heterogeneous Entity Representations}, 
      author={Arjit Jain and Sunita Sarawagi and Prithviraj Sen},
      year={2021},
      eprint={2104.03986},
      archivePrefix={arXiv},
      primaryClass={cs.DB}
}

References:

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Images		Images
MultiLingual		MultiLingual
data		data
DIAL.py		DIAL.py
Eval-AllPairs.py		Eval-AllPairs.py
Eval-Test.py		Eval-Test.py
README.md		README.md
data_utils.py		data_utils.py
index_utils.py		index_utils.py
model_utils.py		model_utils.py
run_eval.sh		run_eval.sh
run_expts.sh		run_expts.sh
run_single.sh		run_single.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images

Images

MultiLingual

MultiLingual

data

data

DIAL.py

DIAL.py

Eval-AllPairs.py

Eval-AllPairs.py

Eval-Test.py

Eval-Test.py

README.md

README.md

data_utils.py

data_utils.py

index_utils.py

index_utils.py

model_utils.py

model_utils.py

run_eval.sh

run_eval.sh

run_expts.sh

run_expts.sh

run_single.sh

run_single.sh

Repository files navigation

DIAL

Getting Started

Environment

Reproducing the Experiments

Citation

About

Releases

Packages

Languages

ArjitJ/DIAL

Folders and files

Latest commit

History

Repository files navigation

DIAL

Getting Started

Environment

Reproducing the Experiments

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages