PLODv2: Character-level Language Models for Abbreviation and Long-form Detection

This repository contains the code and PLODv2 dataset to train character-level language models (CLMs) for abbreviation and long-form detection released with our LREC-COLING 2024 publication (coming soon).

Installation

git clone https://github.com/surrey-nlp/PLODv2-CLM4AbbrDetection.git
conda create -n abbrdet python=3.9
conda activate abbrdet
cd PLODv2-CLM4AbbrDetection
pip install -r requirements.txt

Train abbreviation detection systems

Use our PLODv2 dataset for training abbreviation detection systems:

CUDA_VISIBLE_DEVICES=0 python -m src.train_ner \
    --bio_folder ./PLODv2/filtered_data \
    --embed_model '("glove", "news-forward", "news-backward")' \
    --save_folder ./stacked_glove_news_filtered \
    --learning_rate 0.01 \ 
    --mini_batch_size 32 \
    --max_epochs 150 \
    --use_transformer False

Train character-level language models

Train character-level language models from scratch or for continued pre-training:

CUDA_VISIBLE_DEVICES=0 python -m src.flair_clm \
    --continue_pretrain False \
    --corpus_path ./corpus \
    --is_forward True \
    --out plm \
    --learning_rate 20.0 \ 
    --mini_batch_size 100 \
    --max_epochs 300 \
    --sequence_length 256 \
    --hidden_size 2048

Our models for abbreviation detection

Our fine-tuned models for abbreviation and long form detection can be seen as follows:

No.	PLODv2-Unfiltered	PLODv2-Filtered
1	surrey-nlp/roberta-large-finetuned-abbr-unfiltered-plod	surrey-nlp/roberta-large-finetuned-abbr-filtered-plod
2	surrey-nlp/flair-abbr-pubmed-unfiltered	surrey-nlp/flair-abbr-pubmed-filtered
3	surrey-nlp/flair-abbr-roberta-pubmed-plos-unfiltered	surrey-nlp/flair-abbr-roberta-pubmed-plos-filtered

In the table, No. 1 series models are finetuned on PLODv2 dataset using RoBERTa-large. No. 2 series models are finetuned on PLODv2 by stacking of character-level PubMed models. No. 3 series models are fine-tuned on PLODv2 by stacking of RoBERTa-large and our continued pretrained character-level language models on PLOS based on PubMed models.

Inference

To run (or fine-tune) Transformer models such as RoBERTa large, check our jupyter notebooks. Inference using our fine-tuned stacked-embedding models via flair:

CUDA_VISIBLE_DEVICES=0 python -m src.predict \
    --bio_folder ./PLODv2/filtered_data \
    --model_path surrey-nlp/flair-abbr-pubmed-filtered \
    --pred_file ./predictions.tsv \
    --mini_batch_size 8 \

Citation

Zilio, L, Qian, S., Kanojia, D. and Orasan, C., 2024. Utilizing Character-level Models for Efficient Abbreviation and Long-form Detection. Accepted by the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).

Zilio, L., Saadany, H., Sharma, P., Kanojia, D. and Orasan, C., 2022. PLOD: An Abbreviation Detection Dataset for Scientific Documents. In Proceedings of the Thirteenth Language Resources and Evaluation Conference.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
PLODv1		PLODv1
PLODv2		PLODv2
SDU		SDU
clm		clm
corpus		corpus
notebooks		notebooks
src		src
static_embeddings		static_embeddings
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PLODv1

PLODv1

PLODv2

PLODv2

SDU

SDU

clm

clm

corpus

corpus

notebooks

notebooks

src

src

static_embeddings

static_embeddings

LICENSE.txt

LICENSE.txt

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

PLODv2: Character-level Language Models for Abbreviation and Long-form Detection

Installation

Train abbreviation detection systems

Train character-level language models

Our models for abbreviation detection

Inference

Citation

About

Releases

Packages

Languages

License

surrey-nlp/PLODv2-CLM4AbbrDetection

Folders and files

Latest commit

History

Repository files navigation

PLODv2: Character-level Language Models for Abbreviation and Long-form Detection

Installation

Train abbreviation detection systems

Train character-level language models

Our models for abbreviation detection

Inference

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages