Skip to content

This repository contains the official release of the model "BanglaT5" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaNLG: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla".

Notifications You must be signed in to change notification settings

csebuetnlp/BanglaNLG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BanglaNLG

This repository contains the official release of the model "BanglaT5" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla" accepted in the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023).

Updates

  • We have released BanglaT5 (small). It can be fine-tuned with as little as 4 GB VRAM!

Table of Contents

Models

The BanglaT5 model checkpoint is available at Huggingface model hub.

To use this model for the supported downstream tasks in this repository see Training & Evaluation.

We also release the following finetuned checkpoints:

Model Name Task name
banglat5_nmt_bn_en Bengali-English MT
banglat5_nmt_en_bn English-Bengali MT

Note: This model was pretrained using a specific normalization pipeline available here. All finetuning scripts in this repository uses this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is available at the model page.

Datasets

The benchmarking datasets are as follows:

Please see the BanglaBERT repository to access the pretraining corpus.

Setup

For installing the necessary requirements, use the following snippet

$ git clone https://github.com/csebuetnlp/BanglaNLG
$ cd BanglaNLG/
$ conda create python==3.7.9 pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ bash setup.sh 
  • Use the newly created environment for running the scripts in this repository.

Training & Evaluation

While all tasks we consider are modeled as seq2seq tasks, some tasks need specific data preprocessing for preparing the input and output sequences. See below for task-specific finetuning/inference scripts:

  • Sequence To Sequence.
    • For general sequence to sequence tasks such as
      • Machine Translation
      • Text Summarization
      • News Headline Generation etc.

Benchmarks

  • Supervised fine-tuning
Model Params MT (SacreBLEU) TS (ROUGE-2) QA (EM/F1) MTD (SacreBLEU-1) NHG (ROUGE-2) XLS (ROUGE-2)
mT5 (base) 582M 30.1/17.2 10.3 59.0/65.3 17.5 9.6 2.7/0.7
XLM-ProphetNet 616M 27.5/15.4 7.8 53.0/57.3 20.0 9.5 6.2/2.7
mBART-50 611M 29.7/15.5 10.4 53.4/58.9 18.5 11.2 5.4/3.7
IndicBART (unified) 244M 28.1/16.6 8.9 59.6/65.6 14.8 7.9 6.3/2.5
IndicBART (separate) 244M 27.5/15.7 9.2 55.3/61.2 14.1 9.1 5.3/2.4
BanglaT5 247M 31.3/17.4 13.7 68.5/74.8 19.0 13.8 6.4/4.0

License

Contents of this repository are restricted to non-commercial research purposes only under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Creative Commons License

Citation

If you use any of the datasets, models or code modules, please cite the following paper:

@inproceedings{bhattacharjee-etal-2023-banglanlg,
    title = "{B}angla{NLG} and {B}angla{T}5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in {B}angla",
    author = "Bhattacharjee, Abhik  and
      Hasan, Tahmid  and
      Ahmad, Wasi Uddin  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-eacl.54",
    pages = "726--735",
    abstract = "This work presents {`}BanglaNLG,{'} a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain {`}BanglaT5{'}, a sequence-to-sequence Transformer language model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming several multilingual models by up to 9{\%} absolute gain and 32{\%} relative gain. We are making the new dialogue dataset and the BanglaT5 model publicly available at https://github.com/csebuetnlp/BanglaNLG in the hope of advancing future research on Bangla NLG.",
}

About

This repository contains the official release of the model "BanglaT5" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaNLG: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla".

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published