Skip to content

Awesome Semantic Textual Similarity: a curated list of Semantic Textual Similarity in Large Language Models and NLP

License

Notifications You must be signed in to change notification settings

SuperBruceJia/Awesome-Semantic-Textual-Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Semantic Textual Similarity (STS)

Awesome Semantic Textual Similarity: A Curated List of Semantic/Sentence Textual Similarity (STS) in Large Language Models and the NLP Field

Awesome License: MIT Made With Love

This repository, called Awesome Semantic Textual Similarity, contains a collection of resources and papers on Semantic/Sentence Textual Similarity (STS) in Large Language Models and NLP.

"If you can't measure it, you can't improve it. " - British Physicist William Thomson

Welcome to share your papers, thoughts, and ideas by submitting an issue!

Contents

Model Evolution Overview

Overview_before_2022.png

Presentations

Sentence Textual Similarity: Model Evolution Overview
Shuyue Jia, Dependable Computing Laboratory, Boston University
[Link]
Oct 2023

Benchmarks

Please check here and here to download all the benchmark databases below.

STS

STS12:
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre
SemEval 2012, [Paper] [Download]
07 June 2012

STS13:
*SEM 2013 shared task: Semantic Textual Similarity
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo
*SEM 2013, [Paper] [Download]
13 June 2013

STS14:
SemEval-2014 Task 10: Multilingual Semantic Textual Similarity
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, Janyce Wiebe
SemEval 2014, [Paper] [Download]
23 Aug 2014

STS15:
SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, Janyce Wiebe
SemEval 2015, [Paper] [Download]
04 June 2015

STS16:
SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, Janyce Wiebe
SemEval 2016, [Paper] [Download]
16 June 2016

STS Benchmark (STSb):
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, Lucia Specia
SemEval 2017, [Paper] [Download]
03 Aug 2017

SICK-Relatedness

A SICK Cure for the Evaluation of Compositional Distributional Semantic Models
Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, Roberto Zamparelli
LREC 2014, [Paper] [Download]
26 May 2014

Papers

Baselines

GloVe: Global Vectors for Word Representation
Jeffrey Pennington, Richard Socher, Christopher Manning
EMNLP 2014, [Paper] [GitHub]
25 Oct 2014

Skip-Thought Vectors
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler
NeurIPS 2015, [Paper] [GitHub]
22 Jun 2015

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, Antoine Bordes
EMNLP 2017, [Paper] [GitHub]
07 Sept 2017

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
NAACL-HLT 2019, [Paper] [GitHub]
24 May 2019

BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
ICLR 2020, [Paper] [GitHub]
24 Feb 2020

BLEURT: Learning Robust Metrics for Text Generation
Thibault Sellam, Dipanjan Das, Ankur Parikh
ACL 2020, [Paper] [GitHub]
05 July 2020

Dense Passage Retrieval for Open-Domain Question Answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih
EMNLP 2020, [Paper] [GitHub]
16 Nov 2020

Universal Sentence Encoder
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
arXiv 2018, [Paper] [GitHub]
12 Apr 2018

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers, Iryna Gurevych
EMNLP 2019, [Paper] [GitHub]
27 Aug 2019

Matrix-based Methods

Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement
Hua He, Jimmy Lin
NAACL 2016, [Paper]
12 June 2016

Text Matching as Image Recognition
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, Xueqi Cheng
AAAI 2016, [Paper] [GitHub]
20 Feb 2016

MultiGranCNN: An Architecture for General Matching of Text Chunks on Multiple Levels of Granularity
Myeongjun Jang, Deuk Sin Kwon, Thomas Lukasiewicz
IJCNLP 2015, [Paper]
26 July 2015

Alignment-based Methods

Attention Mechanism

Simple and Effective Text Matching with Richer Alignment Features
Runqi Yang, Jianhai Zhang, Xing Gao, Feng Ji, Haiqing Chen
ACL 2019, [Paper] [GitHub]
01 Aug 2019

Semantic Sentence Matching with Densely-Connected Recurrent and Co-Attentive Information
Seonhoon Kim, Inho Kang, Nojun Kwak
AAAI 2019, [Paper] [GitHub (Unofficial)]
27 January 2019

Multiway Attention Networks for Modeling Sentence Pairs
Chuanqi Tan, Furu Wei, Wenhui Wang, Weifeng Lv, Ming Zhou
IJCAI 2018, [Paper] [GitHub]
13 July 2018

Natural Language Inference over Interaction Space
Yichen Gong, Heng Luo, Jian Zhang
EMNLP 2017, [Paper] [GitHub]
13 Sep 2017

Inter-Weighted Alignment Network for Sentence Pair Modeling
Gehui Shen, Yunlun Yang, Zhi-Hong Deng
EMNLP 2017, [Paper]
07 Sept 2017

Bidirectional Attention Flow for Machine Comprehension
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi
ICLR 2017, [Paper] [Webpage] [GitHub]
24 Apr 2017

A Structured Self-attentive Sentence Embedding
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, Yoshua Bengio
EMNLP 2017, [Paper] [GitHub]
09 Mar 2017

Sentence Similarity Learning by Lexical Decomposition and Composition
Zhiguo Wang, Haitao Mi, Abraham Ittycheriah
COLING 2016, [Paper] [GitHub]
11 Dec 2016

A Decomposable Attention Model for Natural Language Inference
Ankur Parikh, Oscar Täckström, Dipanjan Das, Jakob Uszkoreit
EMNLP 2016, [Paper] [GitHub]
01 Nov 2016

Reasoning about Entailment with Neural Attention
Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Phil Blunsom
ICLR 2016, [Paper] [GitHub]
1 Mar 2016

Traditional Methods

DLS@CU: Sentence Similarity from Word Alignment and Semantic Vector Composition
Md Arafat Sultan, Steven Bethard, Tamara Sumner
SemEval 2015, [Paper]
04 June 2015

Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence
Md Arafat Sultan, Steven Bethard, Tamara Sumner
TACL 2014, [Paper]
01 May 2014

Word Distance-based Methods

Improving Word Mover’s Distance by Leveraging Self-attention Matrix
Hiroaki Yamagiwa, Sho Yokoi, Hidetoshi Shimodaira
EMNLP 2023 Findings, [Paper] [GitHub]
02 Nov 2023

Toward Interpretable Semantic Textual Similarity via Optimal Transport-based Contrastive Sentence Learning
Seonghyeon Lee, Dongha Lee, Seongbo Jang, Hwanjo Yu
ACL 2022, [Paper] [GitHub]
22 May 2022

Word Rotator’s Distance
Sho Yokoi, Ryo Takahashi, Reina Akama, Jun Suzuki, Kentaro Inui
EMNLP 2020, [Paper] [GitHub]
16 Nov 2020

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, Steffen Eger
EMNLP 2019, [Paper] [GitHub]
03 Nov 2019

From Word Embeddings To Document Distances
Matt Kusner, Yu Sun, Nicholas Kolkin, Kilian Weinberger
ICML 2015, [Paper] [GitHub]
06 July 2015

Sentence Embedding-based Methods

Paragraph Vector-based Methods

Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline
Kawin Ethayarajh
RepL4NLP 2018, [Paper] [GitHub]
20 July 2018

An Efficient Framework for Learning Sentence Representations
Lajanugen Logeswaran, Honglak Lee
ICLR 2018, [Paper] [GitHub]
30 Apr 2018

Universal Sentence Encoder
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
arXiv 2018, [Paper] [GitHub]
12 Apr 2018

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, Antoine Bordes
EMNLP 2017, [Paper] [GitHub]
07 Sept 2017

A Simple but Tough-to-Beat Baseline for Sentence Embeddings
Sanjeev Arora, Yingyu Liang, Tengyu Ma
ICLR 2017, [Paper] [GitHub]
06 Feb 2017

Learning Distributed Representations of Sentences from Unlabelled Data
Felix Hill, Kyunghyun Cho, Anna Korhonen
NAACL 2016, [Paper] [GitHub (Unofficial)]
12 Jun 2016

Skip-Thought Vectors
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler
NeurIPS 2015, [Paper] [GitHub]
22 Jun 2015

Distributed Representations of Sentences and Documents
Quoc V. Le, Tomas Mikolov
ICML 2014, [Paper]
21 June 2014

Pretraining-finetuning Paradigm

Whitening Sentence Representations for Better Semantics and Faster Retrieval
Jianlin Su, Jiarun Cao, Weijie Liu, Yangyiwen Ou
arXiv 2021, [Paper] [GitHub (TensorFlow)] [GitHub (PyTorch)]
29 Mar 2021

On the Sentence Embeddings from Pre-trained Language Models
Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, Lei Li
EMNLP 2020, [Paper] [GitHub]
02 Nov 2020

SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models
Bin Wang, C.-C. Jay Kuo
IEEE/ACM T-ASLP, [Paper] [GitHub]
29 July 2020

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers, Iryna Gurevych
EMNLP 2019, [Paper] [GitHub]
27 Aug 2019

BERT-based Scores

BLEURT: Learning Robust Metrics for Text Generation
Thibault Sellam, Dipanjan Das, Ankur Parikh
ACL 2020, [Paper] [GitHub]
05 July 2020

BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
ICLR 2020, [Paper] [GitHub]
24 Feb 2020

Contrastive Learning Framework

Toward Interpretable Semantic Textual Similarity via Optimal Transport-based Contrastive Sentence Learning
Seonghyeon Lee, Dongha Lee, Seongbo Jang, Hwanjo Yu
ACL 2022, [Paper] [GitHub]
22 May 2022

SimCSE: Simple Contrastive Learning of Sentence Embeddings
Tianyu Gao, Xingcheng Yao, Danqi Chen
EMNLP 2021, [Paper] [GitHub]
03 Jun 2021

Self-Guided Contrastive Learning for BERT Sentence Representations
Taeuk Kim, Kang Min Yoo, Sang-goo Lee
ACL 2021, [Paper] [GitHub]
03 Jun 2021

ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer
Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, Weiran Xu
ACL 2021, [Paper] [GitHub]
25 May 2021

Semantic Re-tuning with Contrastive Tension
Fredrik Carlsson, Amaru Cuba Gyllensten, Evangelia Gogoulou, Erik Ylipää Hellqvist, Magnus Sahlgren
ICLR 2021, [Paper] [GitHub]
03 May 2021

CLEAR: Contrastive Learning for Sentence Representation
Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, Hao Ma
arXiv 2020, [Paper]
31 Dec 2020

Distance Measurement

Evolution of Semantic Similarity - A Survey
Dhivya Chandrasekaran, Vijay Mago
ACM Computing Survey 2021, [Paper]
18 February 2021

Distributional Measures of Semantic Distance: A Survey
Saif M. Mohammad, Graeme Hirst
arXiv 2012, [Paper]
8 Mar 2012

Evaluation Metrics

Pearson Correlation

Pearson Linear Correlation Coefficient − measure the prediction accuracy

$$r=\frac{ \sum\nolimits_{i=1}^n \left( s_i-\bar{s} \right) \left( q_i-\bar{q} \right) }{\sqrt{ \sum\nolimits_{i=1}^n \left( s_i-\bar{s} \right)^2 } \sqrt{ \sum\nolimits_{i=1}^n \left( q_i-\bar{q} \right)^2 }},$$

where $s_i$ and $q_i$ are the gold label and the model’s prediction of the $i$-th sentence. $\bar{s}$ and $\bar{q}$ are the mean values of $\textbf{s}$ and $\textbf{q}$. $n$ is the number of sentences.

Spearman Rank Correlation

Spearman’s Rank-order Correlation Coefficient − measure the prediction monotonicity

$$\rho=1-\frac{6 \sum\nolimits_{i=1}^{n} d_i^2 }{n\left(n^2-1\right)},$$

where $d_i$ is the difference between the $i$-th sentence’s rank in the model’s predictions and gold labels.

Citation

If you find our list useful, please consider citing our repo and toolkit in your publications. We provide a BibTeX entry below.

@misc{JiaAwesomeSTS23,
      author = {Jia, Shuyue},
      title = {Awesome Semantic Textual Similarity},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub Repository},
      howpublished = {\url{https://github.com/SuperBruceJia/Awesome-Semantic-Textual-Similarity}},
}

@misc{JiaAwesomeLLM23,
      author = {Jia, Shuyue},
      title = {Awesome {LLM} Self-Consistency},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub Repository},
      howpublished = {\url{https://github.com/SuperBruceJia/Awesome-LLM-Self-Consistency}},
}

@misc{JiaPromptCraft23,
      author = {Jia, Shuyue},
      title = {{PromptCraft}: A Prompt Perturbation Toolkit},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub Repository},
      howpublished = {\url{https://github.com/SuperBruceJia/promptcraft}},
}