Noise-Aware Training (NAT)

Please see NATv2, an improved version of the original NAT framework.

Noise-Aware Training (NAT)

This repository contains the code and the data that can be used to reproduce the experiments described in the "NAT: Noise-Aware Training for Robust Neural Sequence Labeling" paper, which was accepted to be presented at the ACL 2020 conference.

Background

The standard sequence labeling systems are usually trained on clean text. Such systems exhibit substantially lower accuracy when applied on imperfect textual input. NAT aims to improve robustness of sequence labeling performed on data from noisy sources, like Optical Character Recognition (OCR), Automatic Speech Recognition (ASR) or misspelled user-generated text. NAT uses a standard sequence labeling architecture, but modifies the training objective of the neural model. To this end, it employs two auxiliary training objectives:

Data augmentation objective, which directly induces noise in the input data and trains the model on a mixture of clean and noisy samples.
Stability training objective that encourages similarity between the original and the perturbed input, which helps the model to produce a noise-invariant latent representation.

NAT was implemented as an extension to the FLAR library, but it can be integrated into any other sequence labeling framework.

Project Structure

├── flair
├── flair_ext
│   ├── models
│   ├── trainers
│   └── visual
├── resources
│   ├── cmx
│   ├── taggers
│   ├── tasks
│   └── typos
└── robust_ner

The flair directory includes the basic FLAIR framework. See the Quick Start section for more information, how to get it.

The flair_ext directory contains extensions to the basic FLAIR library:

An extended sequence labeling model, which implements both NAT objectives.
A modified trainer class, which performs training using the extended sequence labeling model.

The robust_ner directory comprises of the modules that are used for noise induction, spelling correction and the helper function/classes.

The resources directory includes the data files. Confusion matrices are included in the project. See the Quick Start notes for more information, how to get the typos files and the data sets.

Quick Start

Getting the Code

Clone or download the NAT GitHub repository:

git clone https://github.com/mnamysl/nat-acl2020

Extract "flair.tgz" to the NAT directory:

tar -xzvf flair.tgz

Getting the data

Named Entity Recognition (NER) Data Sets

Please follow the instruction on the websites of the corresponding shared tasks:

CoNLL 2003: https://www.clips.uantwerpen.be/conll2003/ner/
GermEval 2014: https://sites.google.com/site/germeval2014ner/data/

Misspellings, typos

Download typos lists from the following websites:

Misspelling Oblivious Word Embeddings (MOE): https://github.com/facebookresearch/moe/tree/master/data (moe_misspellings_train.tsv)
Typos released by Belinkov & Bisk: https://github.com/ybisk/charNMT-noise/tree/master/noise (de.natural and en.natural files).

Move the downloaded files to the resources/tasks sub-directory.

Installing the Prerequisites

Please install all required packages as shown below:

pip install -r requirements.txt

To use the ELMo embeddings, you need to install the AllenNLP library:

pip install allennlp

If you plan to use the spell checking functionality, you need to install the packages required to run hunspell:

sudo apt-get install hunspell hunspell-de-de hunspell-en-us libhunspell-dev python-dev
pip install hunspell

Using the code

You can use the NAT functionality by calling the main.py python script. The following command-line parameters can be specified (in the order of importance; parameters in bold are required):

Parameter	Description	Value
--mode	Execution mode	One of: train, tune, eval.
--corpus	Data set to use	One of: conll03_en (default), conll03_de, germeval.
--model	Model name	Arbitrary string.
--train_mode	Training mode	One of: standard (default), augmentation, stability.
--alpha	Auxiliary loss weight	Floating point number (default: 1.0).
--misspelling_rate	Noise level	Floating point number (default: 0.0).
--type	Type of embeddings	One of: flair (default), bert, elmo, word+char.
--cmx_file	Confusion matrix file	e.g.: tesseract3-RS.
--typos_file	Typos file	e.g.: en.natural or moe_misspellings_train.tsv.
--spell_check	Use spell checking	no parameters, turned off by default.
--lr	Initial learning rate	Floating point number (default: 0.1).
--train_with_dev	Train with dev. set	no parameters, turned off by default.
--col_idx	Index of a tag column	Integer value (default: 3).
--text_idx	Index of the text column	Integer value (default: 0).
--device	Device to use	Torch device type string (default: cuda).
--downsample	Downsample rate	Floating point value (default: 1.0).
--num_hidden	Tagger hidden state size	Integer value (default: 256).
--max_epochs	Max. training epochs	Integer value (default: 100).
--batch_size	Mini batch size	Integer value (default: 32).
--checkpoint	Checkpoint file name	String (default: best-model.pt).
--no_valid_misspell	No validation with misspellings	No parameters, turned off by default.
--verbose	Print verbose messages	No parameters, turned off by default.
-h	Print help	No parameters.

Training a model from scratch

The following call will start the training of a new model called my_new_model on the English CoNLL 2003 data set using the data augmentation objective with the weight factor of 1.0 and the noise level of 10%.

python3 main.py --mode train --corpus conll03_en --model my_new_model --train_mode augmentation --misspelling_rate 0.1 --alpha 1.0

All your models will be stored in the resources/taggers directory.

Fine-tuning an existing model

The following call will start the fine-tuning process of the previously trained model using the stability objective with different parameters and a lower learning rate:

python3 main.py --mode tune --corpus conll03_en --model my_trained_model --train_mode stability --misspelling_rate 0.05 --alpha 0.5 --lr 0.01

Evaluation

Finally, the prepared model can be evaluated on the real OCR erros by using the following call:

python3 main.py --mode eval --corpus conll03_en --model my_trained_model --cmx_file tesseract3-RS

Citing NAT

Please cite our paper when using the code:

@inproceedings{namysl-etal-2020-nat,
    title = "{NAT}: Noise-Aware Training for Robust Neural Sequence Labeling",
    author = {Namysl, Marcin and Behnke, Sven and K{\"o}hler, Joachim},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.138",
    pages = "1501--1517",
}

Authors

Marcin Namysl dblp, orcid

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
flair_ext		flair_ext
resources/cmx		resources/cmx
robust_ner		robust_ner
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
flair.tgz		flair.tgz
main.py		main.py
requirements.txt		requirements.txt
requirements_optional.txt		requirements_optional.txt

License

mnamysl/nat-acl2020

Folders and files

Latest commit

History

Repository files navigation

Noise-Aware Training (NAT)

Background

Project Structure

Quick Start

Getting the Code

Getting the data

Named Entity Recognition (NER) Data Sets

Misspellings, typos

Installing the Prerequisites

Using the code

Training a model from scratch

Fine-tuning an existing model

Evaluation

Citing NAT

Authors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages