This is the code base for the EMNLP 2020 submission: Evaluating Tough Mentions to Better Understand NER Performance.
All scripts should be run from the directory where the script locates. All scripts mentioned can be found in nereval/
. demo.sh
is provided to demonstrate the whole evaluation process.
NER datasets and corresponding prediction file on its test set. All files need to be of CoNLL format, i.e. set your directory like this:
├── data
│ ├── conll-english # CoNLL 2003 English dataset
│ │ ├── train.txt
│ │ ├── valid.txt
│ │ ├── test.txt
│ │ └── pred.txt # prediction file on test.txt
Unseen/TCM subsets need to be generated first before running the scorer.
- Run
ingest_conll.py
to get pickledtrain.txt
andtest.txt
. - Run
extract_tcm.py
andextract_oov.py
to pickle unseen and type-confusable mention subsets.
- Run
score_oov.py
to get per-type performance on different Unseen mention subsets. - Run
score_tcm.py
to get per-type performance on different type-confusable mention subsets.