This repository contains the code and PLODv2 dataset to train character-level language models (CLMs) for abbreviation and long-form detection released with our LREC-COLING 2024 publication (coming soon).
git clone https://github.com/surrey-nlp/PLODv2-CLM4AbbrDetection.git
conda create -n abbrdet python=3.9
conda activate abbrdet
cd PLODv2-CLM4AbbrDetection
pip install -r requirements.txt
Use our PLODv2 dataset for training abbreviation detection systems:
CUDA_VISIBLE_DEVICES=0 python -m src.train_ner \
--bio_folder ./PLODv2/filtered_data \
--embed_model '("glove", "news-forward", "news-backward")' \
--save_folder ./stacked_glove_news_filtered \
--learning_rate 0.01 \
--mini_batch_size 32 \
--max_epochs 150 \
--use_transformer False
Train character-level language models from scratch or for continued pre-training:
CUDA_VISIBLE_DEVICES=0 python -m src.flair_clm \
--continue_pretrain False \
--corpus_path ./corpus \
--is_forward True \
--out plm \
--learning_rate 20.0 \
--mini_batch_size 100 \
--max_epochs 300 \
--sequence_length 256 \
--hidden_size 2048
Our fine-tuned models for abbreviation and long form detection can be seen as follows:
In the table, No. 1 series models are finetuned on PLODv2 dataset using RoBERTa-large. No. 2 series models are finetuned on PLODv2 by stacking of character-level PubMed models. No. 3 series models are fine-tuned on PLODv2 by stacking of RoBERTa-large and our continued pretrained character-level language models on PLOS based on PubMed models.
To run (or fine-tune) Transformer models such as RoBERTa large, check our jupyter notebooks. Inference using our fine-tuned stacked-embedding models via flair:
CUDA_VISIBLE_DEVICES=0 python -m src.predict \
--bio_folder ./PLODv2/filtered_data \
--model_path surrey-nlp/flair-abbr-pubmed-filtered \
--pred_file ./predictions.tsv \
--mini_batch_size 8 \
Zilio, L, Qian, S., Kanojia, D. and Orasan, C., 2024. Utilizing Character-level Models for Efficient Abbreviation and Long-form Detection. Accepted by the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).
Zilio, L., Saadany, H., Sharma, P., Kanojia, D. and Orasan, C., 2022. PLOD: An Abbreviation Detection Dataset for Scientific Documents. In Proceedings of the Thirteenth Language Resources and Evaluation Conference.