Data preprocessing

To structure Turku Paraphrase Corpus data files (train.json, dev.json, test.json and texts.json.gz) the same way as SQUAD run the make_paraphrase_data.py

python3 make_paraphrase_data.py 
--file dev.json 
--context texts.json.gz 
--output qa_data/dev.json

Fine-tuning BERT

All of the original code for fine tuning can be found in this HuggingFace repository. The code for this project was fetched from the mentioned repository on 7th of June 2021.

To run the code in CSC Mahti:

Clear the loaded modules and load pytorch

module purge 
module load pytorch/1.8

Important! To run HuggingFace example code install transformers from the source:

git clone https://github.com/huggingface/transformers
cd transformers
python -m pip install --user .

To install the requirements:

python -m pip install --user -r requirements.txt

Notes:

This script only works with models that have a fast tokenizer
If your dataset contains samples with no possible answers (like SQUAD version 2), you need to pass along the flag --version_2_with_negative.

Example code for fine-tuning BERT on the preprocessed Turku Paraphrase Corpus dataset.

python3 run_qa.py \
  --model_name_or_path TurkuNLP/bert-base-finnish-cased-v1 \
  --train_file train.json \
  --validation_file dev.json \
  --test_file test.json \
  --do_train \
  --do_eval \
  --do_predict \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 2 \
  --max_seq_length 512 \
  --doc_stride 128 \
  --version_2_with_negative \
  --output_dir /output/ \
  --cache_dir /caches/ \

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
make_paraphrase_data.py		make_paraphrase_data.py
requirements.txt		requirements.txt
run_qa.py		run_qa.py
trainer_qa.py		trainer_qa.py
utils_qa.py		utils_qa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

make_paraphrase_data.py

make_paraphrase_data.py

requirements.txt

requirements.txt

run_qa.py

run_qa.py

trainer_qa.py

trainer_qa.py

utils_qa.py

utils_qa.py

Repository files navigation

Data preprocessing

Fine-tuning BERT

About

Releases

Packages

Languages

HannaKi/Paraphrase-detection-as-question-answering

Folders and files

Latest commit

History

Repository files navigation

Data preprocessing

Fine-tuning BERT

About

Resources

Stars

Watchers

Forks

Languages