Repository for Systematically Comparing Multilingual NER Tools

This repository contains the replication materials for the article "Automatically Finding Actors in Texts: A Performance Review of Multilingual Named Entity Recognition Tools." Communication Methods and Measures 2024. doi: https://doi.org/10.1080/19312458.2024.2324789. For the exact version used in the paper, please go to the branch with the corresponding DOI: https://github.com/mrwunderbar666/ner_tool_comparison/tree/10.1080/19312458.2024.2324789

The repository's main branch is intended to be updated with new corpora and NER tools. Contributions are welcome! You can either create an issue with your suggestion, or make a pull request.

Installation of Requirements

Make sure to install all required packages (Python & R) before proceeding.

Create a virtual environment
- python3 -m venv .venv
- source .venv/bin/activate
Execute the script install_prerequisites.sh

Manual Installation

Install Python Dependencies

python -m pip install -r requirements.txt

Additionally, get a script from huggingface:

curl https://huggingface.co/datasets/conll2003/raw/main/conll2003.py -o utils/conll2003.py

Then, get spaCy models

python -m spacy download zh_core_web_lg
python -m spacy download zh_core_web_trf
python -m spacy download nl_core_news_lg
python -m spacy download en_core_web_lg
python -m spacy download fr_core_news_lg
python -m spacy download de_core_news_lg
python -m spacy download es_core_news_lg
python -m spacy download xx_ent_wiki_sm

Install R Packages

Rscript r_packages.r

Install Tools

python3 tools/corenlp/get_corenlp.py Rscript tools/icews/get_icews.r python3 tools/jrcnames/get_jrc.py python3 tools/nltk/get_dependencies.py python3 tools/opennlp/get_opennlp.py

Data

The datasets for evaluation are the following:

CoNLL 2002 (Dutch & Spanish)
CoNLL 2003 (English & German*)
Europeana (German, French, Dutch)
GermEval2014 (German)
WNUT Emerging Entities (English)
OntoNotes* (English, Chinese, Arabic)
WikiANN* (many)
CNEC 2.0 (Czech)

Alomst every dataset can be downloaded automatically with the supplied scripts. The datasets marked with an asterisk (*) require user intervention. Please refer to the readme.md files in the corresponding sub-directories for instructions.

Please be aware that some datasets are very large and take a while to download and convert

Data Conversion Scripts

Collection of scripts that automatically retrieve the datasets (if possible) and then convert them to a common format.

Every script should be run from the root directory: For example, if you want to automatically get the CoNLL2002 dataset run the following python corpora/conll/get_conll2002.py

When you run the scripts that automatically download and convert the corpora, a registry.csv is created that contains meta-information on each corpus. This file is used by the evaluation scripts to automatically find all available datasets and run the tests.

Each corpus is in tokenized long format (one row = one token) and contains the following columns:

dataset: name of dataset
language: language of dataset / tokens
subset: Original name of subset (or split) of dataset. E.g., training, validation, etc.
sentence_id: id of sentence (string), typically enumerated from 000001. In some cases the corpus also has document ids, then the sentence_id includes the doc_id as well. E.g, 0001_000001.
token_id: id (actually position) of token within the sentence. Always starts at 1.
token: actual token in its original form.
CoNLL_IOB2: Named entity tag according to Inside-Outside-Beginning scheme as defined by CoNLL. Named entities are limited to Persons, Organizations, Location, and Misc.

NER Tools

CoreNLP
NLTK
ICEWS
JRC Names
Nametagger
OpenNLP
spaCy
XLM-RoBERTa (via Huggingface)

Automatically Getting & Installing Tools

Every script should be run from the root directory: For example, if you want to automatically get the CoreNLP run the following python tools/corenlp/get_corenlp.py

Other Tools

https://sites.google.com/site/rmyeid/projects/polylgot-ner
Stanza
Flair
NERF (Polish): http://nkjp.pl/index.php?page=14&lang=1

More Corpora

English

Ultra-Fine Entity Typing (ACL 2018) (Open-ended entity recognition)
Few-NERD

French

License for Quaero corpus prohibits to train a model with the data and to redistribute the resulting model. Hence corpus only for validation purposes.

Polish

NJKP: 1-million-word subcorpus. The manually annotated 1-million word subcorpus of the NJKP, available on CC-BY 4.0.

Russian

bsnlp-2019: http://bsnlp.cs.helsinki.fi/bsnlp-2019/shared_task.html (Russian, Czech, Polish, Bulgarian)
https://www.dialog-21.ru/evaluation/2016/ner/
NERUS: https://github.com/natasha/nerus

Hungarian

"Hungarian Named Entity Corpora": György Szarvas, Richárd Farkas, László Felföldi, András Kocsor, János Csirik: Highly accurate Named Entity corpus for Hungarian. International Conference on Language Resources and Evaluation 2006, Genova (Italy).
- Website: https://rgai.inf.u-szeged.hu/node/130 (download link broken)
- Paper: http://www.inf.u-szeged.hu/projectdirs/hlt/papers/lrec_ne-corpus.pdf

Japanese

Megagon Labs Tokyo NE Extension: https://github.com/megagonlabs/UD_Japanese-GSD

Italian

Italian Content Annotation Bank (I-CAB) https://ontotext.fbk.eu/icab.html
- Foundation for other tasks
- Requires licence agreement
EVALITA 2011: Named Entity Recognition on Transcribed Broadcast News (https://www.evalita.it/campaigns/evalita-2011/tasks/named-entities/)
EVALITA 2016: Named Entity rEcognition and Linking in Italian Tweets Task (http://neel-it.github.io/)
- Data shared in protected GDrive

Collections of more corpora (other domains)

Difficult Examples

See the file challenges.json for a set of sentences which pose challenges for NER tools.

Name		Name	Last commit message	Last commit date
Latest commit History 309 Commits
analyse_results		analyse_results
corpora		corpora
tools		tools
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
challenges.json		challenges.json
install_prerequisites.sh		install_prerequisites.sh
r_packages.r		r_packages.r
requirements.txt		requirements.txt
run_challenges.sh		run_challenges.sh
run_tools.sh		run_tools.sh

License

mrwunderbar666/ner_tool_comparison

Folders and files

Latest commit

History

Repository files navigation