GitHub - matteobrv/ma_thesis: Understanding Morphosyntactic Representations in Pretrained Language Models.

Understanding Morphosyntactic Representations in Pretrained Language Models

Pretrained Language Models (PLMs) have revolutionized NLP, but their linguistic underpinnings still raise several questions. This thesis tries to shed some light on these questions by investigating PLMs' ability to encode morphosyntactic information, focusing on tense and subject-verb agreement.

A novel probing method that leverages neural probes is developed to test the representations generated by three PLM architectures: BERT, RoBERTa, and Sentence Transformer. These PLMs are tested across three morphologically diverse languages: English, Italian, and German.

This repository hosts the code, data and results of my Master's thesis. For a more in-depth explanation of the research questions, data, methodology and findings, please refer to the thesis report.

Requirements

This project uses Poetry, a dependency management and packaging tool for Python. To install Poetry, follow the steps described at https://python-poetry.org/docs/#installation. Additionally, depending on your GPU, you may need to adjust the following line in pyprject.toml to get the appropriate torch version for your setup:

torch = {file = "./torch-2.0.0+rocm5.4.2-cp310-cp310-linux_x86_64.whl"}

After installing Poetry and updating pyprject.toml, you can install the required dependencies and create a dedicated environment by running:

poetry install

Run the Experiments

The details for the tense and agreement experiments in each language are outlined in the respective json files:

it_experiments.json for Italian
en_experiments.json for English
de_experiments.json for German

To run the experiments for a specific language, execute the following command:

python main.py [language_code]_experiments.json

Replace [language_code] with the desired language code (e.g., it, en, or de). This will execute main.py with the specified json file containing the experiment's configuration and data.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
datasets		datasets
models		models
results		results
.gitignore		.gitignore
MA_Thesis.pdf		MA_Thesis.pdf
README.md		README.md
config_parameters.py		config_parameters.py
data.py		data.py
de_experiments.json		de_experiments.json
en_experiments.json		en_experiments.json
evaluator.py		evaluator.py
experiment.py		experiment.py
it_experiments.json		it_experiments.json
main.py		main.py
pyproject.toml		pyproject.toml
trainer.py		trainer.py
tune_parameters.py		tune_parameters.py

matteobrv/ma_thesis

Folders and files

Latest commit

History

Repository files navigation

Understanding Morphosyntactic Representations in Pretrained Language Models

Requirements

Run the Experiments

About

Topics

Resources

Stars

Watchers

Forks

Languages