Skip to content

ZJU-DAILY/CollaborEM

Repository files navigation

CollaborEM: A Self-supervised Entity Matching Framework Using Multi-features Collaboration

CollaborEM, a self-supervised entity matching framework via multi-features collaboration. It is capable of (i) obtaining reliable ER results with zero human annotations and (ii) discovering adequate tuples’ features in a fault-tolerant manner. CollaborEM consists of two phases, i.e., automatic label generation (ALG) and collaborative EM training (CEMT). In the first phase, ALG is proposed to generate a set of positive tuple pairs and a set of negative tuple pairs. ALG guarantees the high quality of the generated tuples, and hence ensure the training quality of the subsequent CEMT. In the second phase, CEMT is introduced to learn the matching signals by discovering graph features and sentence features of tuples collaboratively.

For more technical details, see CollaborEM: A Self-supervised Entity Matching Framework using Multi-features Collaboration.

framework

Requirements

  • Python 3.7
  • PyTorch 1.7.1
  • CUDA 11.0
  • HuggingFace Transformers 4.4.2
  • Sentence Transformers 1.0.4
  • NVIDIA Apex (fp16 training)

①Download er.tar.gz, we recommend using conda-pack to reproduce the environment:

pip install conda-pack
mkdir -p er
tar -xzf er.tar.gz -C er
./er/bin/python
source er/bin/activate

②Download and unzip lm_model.

Datasets

We conduct experiments on eight representative and widely-used EM benchmarks with different sizes and in various domains from DeepMatcher paper.

The dataset configurations can be found in configs.json.

Training with CollaborEM

Download and unzip preprocessed data.

To train the matching model with CollaborEM:

python run_all.py

You can download checkpoints here.

Acknowledgement

We use the code of DITTO and AttrGNN.

About

Code for the paper "CollaborEM: A Self-supervised Entity Matching Framework Using Multi-features Collaboration". TKDE 2021.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages