semEval2024: Semantic Textual Relatedness - Pinealai Code

Description Task 1 English

A shared task on automatically detecting the degree of semantic relatedness between pairs of sentences. New textual datasets will be provided for Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu.

Link of the shared task: SemEval2024

Main objective

The main purpose for us was to identify and find features and explainable ways of the impact of those features in predicting semantic textual similarity in English.

Brief Methodology

In this study, the objective is to predict semantic textual relatedness between two texts. We made two key assumptions:

We refrained from preprocessing the corpus to preserve sentence structure, essential for information retrieval and semantic identification (Hirst, 1987);
We intentionally excluded Large Language Models (LLMs) from experiments(fine-tuning), anticipating challenges in interpreting specific features contributing to semantic identification due to their contextual abilities and complexity.

Below is the diagram of our method:

How to use the Repo?

You don't need to run everything. If you set up the environment correctly, you can run the "synt.py" to get the models' performances in your terminal. The generated features in the repo will be used automatically for this purpose. The code is quiet really simple to understand no fancy classes have been created and we added some comments too.

How to reproduce The experiment and Improve it?

Requirements

To be able to reproduce the work, you will need to create a virtual environment and install the dependencies we used. The main ones are scikit-learn, spacy, nltk and sentence_transformers. But it is better, if you install the dependencies from the requirements.txt in the folder. For the Jupyter Notebook code, it will be easy to know the dependencies needed.

NB: The OS under which we perform the experiments is LINUX UBUNTU 22.04, concerning the scripts.

Set up of the environment

So, the first thing is to clone the repo, and set it locally:

git clone https://github.com/Anvi98/semEval2024_code.github

Create the virtual environment and activate it. (you need to be located in the folder):

python3 -m venv env 
source env/bin/activate

Then, install the dependencies. (type 'pip' if not using python3):

pip3 install -r requirements.txt

Start Working on those files and improve them

After installing, the dependencies, everything is set. You will be able to run these files:

embed.py (Extract Bert embedding)
fuzzy_h.py (Computed Levenshtein distance of pair of sentences)
syntactic_features.py (Extract syntactic features of pair of sentences)
synt.py (Training and prediction of traditional ML models)

Authors:

Anvi Alex Eponon
Luis Ramos

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Luis_STR_Eng_CIC.ipynb		Luis_STR_Eng_CIC.ipynb
Method-STR.png		Method-STR.png
README.md		README.md
embed.py		embed.py
fuzzy_h.py		fuzzy_h.py
luis_test_v2.txt		luis_test_v2.txt
luis_train_v2.txt		luis_train_v2.txt
luis_val_v2.txt		luis_val_v2.txt
pap_leven_score_dev_lab.npz		pap_leven_score_dev_lab.npz
pap_leven_score_train.npz		pap_leven_score_train.npz
pap_res_vo_dev_lab.npz		pap_res_vo_dev_lab.npz
pap_res_vo_train.npz		pap_res_vo_train.npz
pap_synctact_dev_lab.npz		pap_synctact_dev_lab.npz
pap_synctact_train.npz		pap_synctact_train.npz
requirements.txt		requirements.txt
synt.py		synt.py
syntactic_features.py		syntactic_features.py

Anvi98/semEval2024_code

Folders and files

Latest commit

History

Repository files navigation

semEval2024: Semantic Textual Relatedness - Pinealai Code

Description Task 1 English

Main objective

Brief Methodology

How to use the Repo?

How to reproduce The experiment and Improve it?

Requirements

Set up of the environment

Start Working on those files and improve them

Authors:

About

Topics

Resources

Stars

Watchers

Forks

Languages