Skip to content

Understanding Morphosyntactic Representations in Pretrained Language Models.

Notifications You must be signed in to change notification settings

matteobrv/ma_thesis

Repository files navigation

Understanding Morphosyntactic Representations in Pretrained Language Models

Pretrained Language Models (PLMs) have revolutionized NLP, but their linguistic underpinnings still raise several questions. This thesis tries to shed some light on these questions by investigating PLMs' ability to encode morphosyntactic information, focusing on tense and subject-verb agreement.

A novel probing method that leverages neural probes is developed to test the representations generated by three PLM architectures: BERT, RoBERTa, and Sentence Transformer. These PLMs are tested across three morphologically diverse languages: English, Italian, and German.

This repository hosts the code, data and results of my Master's thesis. For a more in-depth explanation of the research questions, data, methodology and findings, please refer to the thesis report.

Requirements

This project uses Poetry, a dependency management and packaging tool for Python. To install Poetry, follow the steps described at https://python-poetry.org/docs/#installation. Additionally, depending on your GPU, you may need to adjust the following line in pyprject.toml to get the appropriate torch version for your setup:

torch = {file = "./torch-2.0.0+rocm5.4.2-cp310-cp310-linux_x86_64.whl"}

After installing Poetry and updating pyprject.toml, you can install the required dependencies and create a dedicated environment by running:

poetry install

Run the Experiments

The details for the tense and agreement experiments in each language are outlined in the respective json files:

  • it_experiments.json for Italian
  • en_experiments.json for English
  • de_experiments.json for German

To run the experiments for a specific language, execute the following command:

python main.py [language_code]_experiments.json

Replace [language_code] with the desired language code (e.g., it, en, or de). This will execute main.py with the specified json file containing the experiment's configuration and data.