Named Entity Recognition (NER) API

API for performing named entity recognition from text input in Finnish. The model was trained by fine-tuning a Finnish BERT language model to recognize 10 named entity categories:

PERSON (person names)
ORG (organizations)
LOC (locations)
GPE (geopolitical locations)
PRODUCT (products)
EVENT (events)
DATE (dates)
JON (Finnish journal numbers (diaarinumero))
FIBC (Finnish business identity codes (y-tunnus))
NORP (nationality, religious and political groups)

Model training and testing

The code used for training the model is available here. More information on the training data, model parameters and test results is available at the HuggingFace page hosting the model.

Running the API

The API code has been built using the FastAPI library. It can be run either in a virtual environment, or in a Docker container. Instructions for both options are given below.

The API downloads latest versions of the model files from HuggingFace when the code is run. By default, the files are saved to ~/.cache/huggingface/hub/.

This path can be modified by exporting the environment variable TRANSFORMERS_CACHE. For example in bash shell type export TRANSFORMERS_CACHE=/path/to/cache before running the code.

Output format

The model makes predictions for named entities in the IOB2-format, where the B-prefix is used for the first token of an entity, and I-prefix for all subsequent tokens belonging to the same entity.

Different aggregation strategies can be used for changing the model output format. These can be changed with the parameter AGGREGATION_STRATEGY when starting the API. For example

AGGREGATION_STRATEGY="simple" uvicorn api:app

Aggregation strategy: 'none'

By default, model output follows the input format, which is based on wordpiece tokenization. Therefore, for example the input sentence 'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812.', when the aggregation strategy 'none is used, produces the output

[{'entity': 'B-GPE', 'score': 0.9999044, 'index': 1, 'word': 'Helsingistä', 'start': 0, 'end': 11}, {'entity': 'B-GPE', 'score': 0.9991748, 'index': 3, 'word': 'Suomen', 'start': 17, 'end': 23}, {'entity': 'I-GPE', 'score': 0.9968881, 'index': 4, 'word': 'suuri', 'start': 24, 'end': 29}, {'entity': 'I-GPE', 'score': 0.9972023, 'index': 5, 'word': '##ru', 'start': 29, 'end': 31}, {'entity': 'I-GPE', 'score': 0.99688524, 'index': 6, 'word': '##htina', 'start': 31, 'end': 36}, {'entity': 'I-GPE', 'score': 0.99559337, 'index': 7, 'word': '##sku', 'start': 36, 'end': 39}, {'entity': 'I-GPE', 'score': 0.99525815, 'index': 8, 'word': '##nna', 'start': 39, 'end': 42}, {'entity': 'I-GPE', 'score': 0.99037445, 'index': 9, 'word': '##n', 'start': 42, 'end': 43}, {'entity': 'B-DATE', 'score': 0.999951, 'index': 11, 'word': 'vuonna', 'start': 56, 'end': 62}, {'entity': 'I-DATE', 'score': 0.9998229, 'index': 12, 'word': '18', 'start': 63, 'end': 65}, {'entity': 'I-DATE', 'score': 0.9999138, 'index': 13, 'word': '##12', 'start': 65, 'end': 67}]

This is a list of dictionaries, where each dictionary containsthe following keys and values:

entity: Defines the predicted entity group of the token, using the IOB2 schema.
score: Confidence score that the model gives to the prediction.
index: Index of the token in the tokenized text input.
word: Token / wordpiece for which the prediction is made. In the above example, for instance the word 'suuriruhtinaskunnan' is split into six wordpieces, where the pieces following the first one begin with '##'.
start: Index of the start of the token/wordpiece.
end: Index of the end of the token/wordpiece.

Aggregation strategy: 'simple'

This aggregation strategy groups together the B- and I-parts of the same entities into a single entity. Now the output for the example sentence becomes:

[{'entity_group': 'GPE', 'score': 0.9999044, 'word': 'Helsingistä', 'start': 0, 'end': 11}, {'entity_group': 'GPE', 'score': 0.995911, 'word': 'Suomen suuriruhtinaskunnan', 'start': 17, 'end': 43}, {'entity_group': 'DATE', 'score': 0.9998959, 'word': 'vuonna 1812', 'start': 56, 'end': 67}]

Now for example the word 'suuriruhtinaskunnan' is one token belonging to entity group 'GPE'. Token/wordpiece index is omitted from the results. More information on the 'simple' strategy and its variations ('first', 'average', 'max') can be found here. By default, the 'first' strategy is used in the API.

Aggregation strategy: 'custom'

This aggregation option is custom built, and is not part of the transformers-library. The goal is to group together wordpieces belonging to a single B- or I-tag, so that the aggregation preserves the IOB2-style annotation format. The output for the example sentence is:

[{"entity_group":"B-GPE","score":0.9999043941497803,"word":"Helsingistä","start":0,"end":11},{"entity_group":"B-GPE","score":0.9991747736930847,"word":"Suomen","start":17,"end":23},{"entity_group":"I-GPE","score":0.9953669706980387,"word":"suuriruhtinaskunnan","start":24,"end":43},{"entity_group":"B-DATE","score":0.9999510049819946,"word":"vuonna","start":56,"end":62},{"entity_group":"I-DATE","score":0.9998683929443359,"word":"1812","start":63,"end":67}]

Running the API in a virtual environment

These instructions use a conda virtual environment, and as a precondition you should have Miniconda or Anaconda installed on your operating system. More information on the installation is available here.

Create and activate conda environment using the following commands:

conda create -n ner_api_env python=3.7

conda activate ner_api_env

Install dependencies listed in the requirements.txt file:

pip install -r requirements.txt

Start the API running a single process (with Uvicorn server):

Using default host: 0.0.0.0, default port: 8000

uvicorn api:app

Select different host / port:

uvicorn api:app --host 0.0.0.0 --port 8080

You can also start the API with Gunicorn as the process manager (find more information here) (NB! does not work on Windows):

gunicorn api:app --workers 2 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8080

workers: The number of worker processes to use, each will run a Uvicorn worker
worker-class: The Gunicorn-compatible worker class to use in the worker processes
bind: This tells Gunicorn the IP and the port to listen to, using a colon (:) to separate the IP and the port

Running the API using Docker

As a precondition, you should have Docker Engine installed. More information on the installation can be found here.

Build Docker image using the Dockerfile included in the repository:

sudo docker build -t ner_image .

Here the new image is named ner_image. After successfully creating the image, you can find it in the list of images by typing docker image ls.

Create and run a container based on the image:

sudo docker run -d --name ner_container -p 8000:8000 ner_image

In the Dockerfile, port 8000 is exposed, meaning that the container listens to that port. In the above command, the corresponding host port can be chosen as the first element in -p <host-port>:<container-port>. If only the container port is specified, Docker will automatically select a free port as the host port. The port mapping of the container can be viewed with the command sudo docker port postit_container

If you want to change the default aggregation strategy ('simple') when creating the container, this can be done by using the -e flag:

sudo docker run -d --name ner_container -p 8000:8000 -e AGGREGATION_STRATEGY="custom" ner_image

Logging

Logging events are saved into a file api_log.log in the same folder where the api.py file is located. Previous content of the log file is overwritten after each restart. More information on different logging options is available here.

Testing the API

The API has one endpoint, /ner, which expects the input text to be included in the client's POST request.

Input format

The input text is expected to be in a json format, where the key 'text' is used for defining the content:

'{"text": "Example text in Finnish."}'

Testing the API in a virtual environment

You can test the API for example using curl:

curl -d '{"text": "Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812."}' -H "Content-Type: application/json" -X POST http://127.0.0.1:8000/ner

NB! Windows users might encounter following error Invoke-WebRequest : A parameter cannot be found that matches parameter name 'F'.. This can be bypassed by running a command Remove-item alias:curl.

The host and port should be the same ones that were defined when starting the API.

Testing the API using Docker

The Docker version of the API can bes tested (when the container is running) for example with curl using the same arguments as above.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api.py		api.py
image.jpg		image.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

api.py

api.py

image.jpg

image.jpg

requirements.txt

requirements.txt

Repository files navigation

Named Entity Recognition (NER) API

Model training and testing

Running the API

Output format

Aggregation strategy: 'none'

Aggregation strategy: 'simple'

Aggregation strategy: 'custom'

Running the API in a virtual environment

Create and activate conda environment using the following commands:

Install dependencies listed in the requirements.txt file:

Start the API running a single process (with Uvicorn server):

You can also start the API with Gunicorn as the process manager (find more information here) (NB! does not work on Windows):

Running the API using Docker

Build Docker image using the Dockerfile included in the repository:

Create and run a container based on the image:

Logging

Testing the API

Input format

Testing the API in a virtual environment

Testing the API using Docker

About

Contributors 2

Languages

License

DALAI-project/NER_API

Folders and files

Latest commit

History

Repository files navigation

Named Entity Recognition (NER) API

Model training and testing

Running the API

Output format

Aggregation strategy: 'none'

Aggregation strategy: 'simple'

Aggregation strategy: 'custom'

Running the API in a virtual environment

Create and activate conda environment using the following commands:

Install dependencies listed in the requirements.txt file:

Start the API running a single process (with Uvicorn server):

You can also start the API with Gunicorn as the process manager (find more information here) (NB! does not work on Windows):

Running the API using Docker

Build Docker image using the Dockerfile included in the repository:

Create and run a container based on the image:

Logging

Testing the API

Input format

Testing the API in a virtual environment

Testing the API using Docker

About

Topics

Resources

License

Stars

Watchers

Forks

Languages