Skip to content

MeLLL-UFF/bambas

Repository files navigation

bambas

Code for SemEval Task4 Subtask 1

Environment

Installation

First, clone sklearn-hierarchical-classification repository:

git clone https://github.com/lfmatosm/sklearn-hierarchical-classification

With pip

pip install -r requirements.txt

With pipenv

pipenv shell
pipenv install

If you encounter any problems related to installing sklearn-hierarchical-classification with pipenv, just ignore it.

After the previous steps, use pip to install the local repository:

pip install ../sklearn-hierarchical-classification # point to the cloned repository path

Running

For a working Google Colab example, please refer to this notebook.

For a quickstart using a shell script, please refer to this shell script

For a multilabel classification example, please refer to this notebook.

Fine-tuning for MLM

python -m src.fine_tuning \
  --model xlm-roberta-base \
  --dataset ptc2019 \
  --fine_tuned_name xlm-roberta-base-ptc2019 \
  --save_model

Fine-tuning + multilabel classification layer

python -m src.fine_tuning_with_class \
  --model jhu-clsp/bernice \
  --dataset semeval2024_dev_labeled \
  --fine_tuned_name jhu-clsp-bernice-semeval2024-dev-labeled-classifier \
  --batch_size 8 \
  --save_strategy epoch \
  --lr 3.9e-5 \
  --epochs 5 \
  --save_model

Feature-extraction

Using the [CLS] token:

python -m src.feature_extraction \
  --model xlm-roberta-base \
  --dataset semeval2024 \
  --extraction_method cls \

Or if you want to use specific hidden-layers:

python -m src.feature_extraction \
  --model xlm-roberta-base \
  --dataset semeval2024 \
  --extraction_method layers \
  --layers 4 5 6 7 \
  --agg_method "avg"

Or if you want to use sentence embeddings:

python -m src.feature_extraction \
  --model "sentence-transformers/stsb-xlm-r-multilingual" \
  --dataset semeval2024 \
  --extraction_method sentence

You can also specify a folder for saving the features:

python -m src.feature_extraction \
  --model "sentence-transformers/jhu-clsp/bernice" \
  --dataset semeval2024 \
  --extraction_method cls \
  --output_dir test_folder/

Classification

Using a Binary Relevance classifier. Notice those have a few optional arguments that may be relevant to Oversampling

python -m src.classification \
  --classifier "LogisticRegression" \
  --dataset semeval2024 \
  --train_features "./feature_extraction/train_features.json" \
  --test_features "./feature_extraction/test_features.json" \
  --dev_features "./feature_extraction/dev_features.json" \
  --seed 1 \
  --oversampler Combination \
  --sample_strategy 1

Using a multilabel feedforward classifier:

python -m src.classification \
  --classifier "MLP" \
  --dataset semeval2024 \
  --train_features "./feature_extraction/train_features.json" \
  --test_features "./feature_extraction/test_features.json" \
  --dev_features "./feature_extraction/dev_features.json" \
  --seed 1