Skip to content

Codebase for the journal paper "The Rare Word Issue in Natural Language Generation: a Character-Based Solution" (Giovanni Bonetta, Marco Roberti, Rossella Cancelliere, Patrick Gallinari)

License

marco-roberti/char-dtt-rareword

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Character-Based Data-to-Text Generation

This repository contains the source code and the datasets used for the journal paper Rare Word Issue in Natural Language Generation: a Character-Based Solution by Giovanni Bonetta, Marco Roberti, Rossella Cancelliere, and Patrick Gallinari.

Step-by-step guide

Requirements

Prior to use the code, install the following packages. Versions used in the experiments are reported; the code should work with more recent versions too.

  • Python (3.7.7)
  • tqdm (1.19.2)
  • NumPy (4.56.0)
  • PyTorch (1.7.1)
  • Matplotlib (3.1.3)

Training

The main.py file is used to train the ED+A or ED+ACS model. The only required argument is a random seed:

python3 main.py {seed} -d E2E -m EDA
python3 main.py {seed} -d E2E -m EDACS  # default
python3 main.py {seed} -d E2ENY -m EDA
python3 main.py {seed} -d E2ENY -m EDACS

Different hyperparameters can be set via argparse (run python3 main.py -h for more details):

  --dataset {E2E,E2ENY}  # default: E2E
  --model {EDA,EDACS}  # default: EDACS
  --attention_size 128
  --embedding_size 32
  --hidden_size 300
  --layers 3
  --total_epochs 32
  --batch_size 32
  --learning_rate 0.001
  --clip_norm CLIP_NORM 5
  --cosine_tmax 50000  # T_max argument for CosineAnnealingLR
  --cosine_etamin 0  # eta_min argument for CosineAnnealingLR

At the end of the training phase, one checkpoint for each epoch will be stored in the trained_nets/{timestamp}/ folder, where timestamp is the UNIX time of starting the script.

Generation

The create_eval_files.py script will generate both outputs and references files, which can be directly used as inputs for the evaluation script. For example, you can generate on the E2E development set using ED+ACS as follows:

PYTHONPATH=. python3 utils/create_eval_files.py {seed} trained_nets/{timestamp}/{checkpoint} dev -d E2E -m EDACS  # default

This will create the trained_nets/{timestamp}/{checkpoint}.dev.output and trained_nets/{timestamp}/{checkpoint}.dev.references files.

You can choose the dataset and the architecture via argparse. Different architecture's arguments used for training must be set accordingly (run PYTHONPATH=. python3 utils/create_eval_files.py -h for more details):

--dataset {E2E,E2ENY}  # default: E2E
--model {EDA,EDACS}  # default: EDACS
--attention_size 128
--embedding_size 32
--hidden_size 300
--layers 3

Evaluation

We took advantage of the E2E NLG Challenge Evaluation metrics. Please refer to their repository for detailed instructions.

Citations

Please use the following BibTeX snippet to cite our work:

@Article{informatics8010020,
    AUTHOR = {Bonetta, Giovanni and Roberti, Marco and Cancelliere, Rossella and Gallinari, Patrick},
    TITLE = {The Rare Word Issue in Natural Language Generation: A Character-Based Solution},
    JOURNAL = {Informatics},
    VOLUME = {8},
    YEAR = {2021},
    NUMBER = {1},
    ARTICLE-NUMBER = {20},
    URL = {https://www.mdpi.com/2227-9709/8/1/20},
    ISSN = {2227-9709},
    DOI = {10.3390/informatics8010020}
}

About

Codebase for the journal paper "The Rare Word Issue in Natural Language Generation: a Character-Based Solution" (Giovanni Bonetta, Marco Roberti, Rossella Cancelliere, Patrick Gallinari)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages