Skip to content

studio-ousia/ease

Repository files navigation

EASE: Entity-Aware Contrastive Learning of Sentence Embedding

Hugging Face Transformers Hugging Face Models Arxiv

EASE is a novel method for learning sentence embeddings via contrastive learning between sentences and their related entities proposed in our paper EASE: Entity-Aware Contrastive Learning of Sentence Embedding. This repository contains the source code to train the model and evaluate it with downstream tasks. Our code is mainly based on that of SimCSE.

Released Models

Hugging Face Models

Our published models are listed as follows. You can use these models by using HuggingFace's Transformers.

Monolingual Models Avg. STS Avg. STC
sosuke/ease-bert-base-uncased 77.0 63.1
sosuke/ease-roberta-base 76.8 58.6
Multilingual Models Avg. mSTS Avg. mSTC
sosuke/ease-bert-base-multilingual-cased 57.2 36.1
sosuke/ease-xlm-roberta-base 57.1 36.3

Use EASE with Huggingface

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our pretrained model. 
tokenizer = AutoTokenizer.from_pretrained("sosuke/ease-bert-base-multilingual-cased")
model = AutoModel.from_pretrained("sosuke/ease-bert-base-multilingual-cased")

# Set pooler.
pooler = lambda last_hidden, att_mask: (last_hidden * att_mask.unsqueeze(-1)).sum(1) / att_mask.sum(-1).unsqueeze(-1)

# Tokenize input texts.
texts = [
    "Ils se préparent pour un spectacle à l'école.",
    "They are preparing for a show at school.",
    "Two medical professionals in green look on at something."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    last_hidden = model(**inputs, output_hidden_states=True, return_dict=True).last_hidden_state
embeddings = pooler(last_hidden, inputs["attention_mask"])

# Calculate cosine similarities
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print(f"Cosine similarity between {texts[0]} and {texts[1]} is {cosine_sim_0_1}")
print(f"Cosine similarity between {texts[0]} and {texts[2]} is {cosine_sim_0_2}")

Please see here for other pooling methods.

Setups

Python

Run the following script to install the dependent libraries.

pip install -r requirements.txt

Before training, please download the datasets for training and evaluation.

bash download_all.sh

Evaluation

We provide evaluation code for sentence embeddings including Semantic Textual Similarity (STS 2012-2016, STS Benchmark, SICK-elatedness, and the extended version of STS 2017 dataset), Short Text Clustering (Eight STC benchmarks and MewsC-16), Cross-lingual Parallel Matching (Tatoeba) and Cross-lingual Text Classification (MLDoc).

Set your model or path of tranformers-based checkpoint (--model_name_or_path), pooling method type (--pooler), and what set of tasks (--task_set). See the example code below.

Semantic Textual Similarity
python evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \ 
    --pooler avg \ 
    --task_set cl-sts 
Short Text Clustering
python downstreams/text-clustering/evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \
    --pooler avg \ 
    --task_set cl
Cross-lingual Parallel Matching
python downstreams/parallel-matching/evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \ 
    --pooler avg 
Cross-lingual Text Classification
python downstreams/cross-lingual-transfer/evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \ 
    --pooler avg

Please refer to each evaluation code for detailed descriptions of arguments.

Training

You can train an EASE model in a monolingual setting using English Wikipedia sentences or in a multilingual setting using Wikipedia sentences in 18 languages.

We provide example trainig scripts for both monolingual (train_monolingual_ease.sh) and multilingual (train_multilingual_ease.sh) settings.

MewsC-16

We construct MewsC-16 (Multilingual Short Text Clustering Dataset for News in 16 languages) from Wikinews. This dataset contains topic sentences from Wikinews articles in 13 categories and 16 languages. More detailed information is available in our paper, Appendix E.

Statistics and Scores
Language Sentences Label types XLM-Rbase EASE-XLM-Rbase
ar 2,224 11 27.9 27.4
ca 3,310 11 27.1 27.9
cs 1,534 9 25.2 41.2
de 6,398 8 30.5 39.5
en 12,892 13 25.8 39.6
eo 227 8 24.7 37.0
es 6,415 11 20.8 38.2
fa 773 9 37.2 41.5
fr 10,697 13 25.3 33.3
ja 1,984 12 44.0 47.6
ko 344 10 24.1 33.7
pl 7,247 11 28.8 39.9
pt 8,921 11 27.4 32.9
ru 1,406 12 20.1 27.2
sv 584 7 30.1 29.8
tr 459 7 30.7 44.9
Avg. 28.1 36.3

Note that the results are slightly different from those reported in the original paper since we further cleaned the data after the publication.

Citation

Arxiv

@inproceedings{nishikawa-etal-2022-ease,
    title = "{EASE}: Entity-Aware Contrastive Learning of Sentence Embedding",
    author = "Nishikawa, Sosuke  and
      Ri, Ryokan  and
      Yamada, Ikuya  and
      Tsuruoka, Yoshimasa  and
      Echizen, Isao",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.284",
    pages = "3870--3885",
    abstract = "We present EASE, a novel method for learning sentence embeddings via contrastive learning between sentences and their related entities.The advantage of using entity supervision is twofold: (1) entities have been shown to be a strong indicator of text semantics and thus should provide rich training signals for sentence embeddings; (2) entities are defined independently of languages and thus offer useful cross-lingual alignment supervision.We evaluate EASE against other unsupervised models both in monolingual and multilingual settings.We show that EASE exhibits competitive or better performance in English semantic textual similarity (STS) and short text clustering (STC) tasks and it significantly outperforms baseline methods in multilingual settings on a variety of tasks.Our source code, pre-trained models, and newly constructed multi-lingual STC dataset are available at https://github.com/studio-ousia/ease.",
}