Character-Based Data-to-Text Generation

This repository contains the source code and the datasets used for the journal paper Rare Word Issue in Natural Language Generation: a Character-Based Solution by Giovanni Bonetta, Marco Roberti, Rossella Cancelliere, and Patrick Gallinari.

Step-by-step guide

Requirements

Prior to use the code, install the following packages. Versions used in the experiments are reported; the code should work with more recent versions too.

Python (3.7.7)
tqdm (1.19.2)
NumPy (4.56.0)
PyTorch (1.7.1)
Matplotlib (3.1.3)

Training

The main.py file is used to train the ED+A or ED+ACS model. The only required argument is a random seed:

python3 main.py {seed} -d E2E -m EDA
python3 main.py {seed} -d E2E -m EDACS  # default
python3 main.py {seed} -d E2ENY -m EDA
python3 main.py {seed} -d E2ENY -m EDACS

Different hyperparameters can be set via argparse (run python3 main.py -h for more details):

  --dataset {E2E,E2ENY}  # default: E2E
  --model {EDA,EDACS}  # default: EDACS
  --attention_size 128
  --embedding_size 32
  --hidden_size 300
  --layers 3
  --total_epochs 32
  --batch_size 32
  --learning_rate 0.001
  --clip_norm CLIP_NORM 5
  --cosine_tmax 50000  # T_max argument for CosineAnnealingLR
  --cosine_etamin 0  # eta_min argument for CosineAnnealingLR

At the end of the training phase, one checkpoint for each epoch will be stored in the trained_nets/{timestamp}/ folder, where timestamp is the UNIX time of starting the script.

Generation

The create_eval_files.py script will generate both outputs and references files, which can be directly used as inputs for the evaluation script. For example, you can generate on the E2E development set using ED+ACS as follows:

PYTHONPATH=. python3 utils/create_eval_files.py {seed} trained_nets/{timestamp}/{checkpoint} dev -d E2E -m EDACS  # default

This will create the trained_nets/{timestamp}/{checkpoint}.dev.output and trained_nets/{timestamp}/{checkpoint}.dev.references files.

You can choose the dataset and the architecture via argparse. Different architecture's arguments used for training must be set accordingly (run PYTHONPATH=. python3 utils/create_eval_files.py -h for more details):

--dataset {E2E,E2ENY}  # default: E2E
--model {EDA,EDACS}  # default: EDACS
--attention_size 128
--embedding_size 32
--hidden_size 300
--layers 3

Evaluation

We took advantage of the E2E NLG Challenge Evaluation metrics. Please refer to their repository for detailed instructions.

Citations

Please use the following BibTeX snippet to cite our work:

@Article{informatics8010020,
    AUTHOR = {Bonetta, Giovanni and Roberti, Marco and Cancelliere, Rossella and Gallinari, Patrick},
    TITLE = {The Rare Word Issue in Natural Language Generation: A Character-Based Solution},
    JOURNAL = {Informatics},
    VOLUME = {8},
    YEAR = {2021},
    NUMBER = {1},
    ARTICLE-NUMBER = {20},
    URL = {https://www.mdpi.com/2227-9709/8/1/20},
    ISSN = {2227-9709},
    DOI = {10.3390/informatics8010020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
datasets		datasets
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

models

models

utils

utils

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

Repository files navigation

Character-Based Data-to-Text Generation

Step-by-step guide

Requirements

Training

Generation

Evaluation

Citations

About

Releases

Packages

Languages

License

marco-roberti/char-dtt-rareword

Folders and files

Latest commit

History

Repository files navigation

Character-Based Data-to-Text Generation

Step-by-step guide

Requirements

Training

Generation

Evaluation

Citations

About

Topics

Resources

License

Stars

Watchers

Forks

Languages