Skip to content

nlpaueb/gr-nlp-toolkit

Repository files navigation

gr-nlp-toolkit

A Transformer-based natural language processing toolkit for (modern) Greek. The toolkit has state-of-the art performance in Greek and supports named entity recognition, part-of-speech tagging, morphological tagging, as well as dependency parsing. For more information, please consult the following theses:

C. Dikonimaki, "A Transformer-based natural language processing toolkit for Greek -- Part of speech tagging and dependency parsing", BSc thesis, Department of Informatics, Athens University of Economics and Business, 2021. http://nlp.cs.aueb.gr/theses/dikonimaki_bsc_thesis.pdf

N. Smyrnioudis, "A Transformer-based natural language processing toolkit for Greek -- Named entity recognition and multi-task learning", BSc thesis, Department of Informatics, Athens University of Economics and Business, 2021. http://nlp.cs.aueb.gr/theses/smyrnioudis_bsc_thesis.pdf

Installation

You can install the toolkit by executing the following in the command line:

pip install gr-nlp-toolkit

Usage

To use the toolkit first initialize a Pipeline specifying which processors you need. Each processor annotates the text with a specific task's annotations.

  • To obtain Part-of-Speech and Morphological Tagging annotations add the pos processor
  • To obtain Named Entity Recognition annotations add the ner processor
  • To obtain Dependency Parsing annotations add the dp processor
from gr_nlp_toolkit import Pipeline
nlp = Pipeline("pos,ner,dp") # Use ner,pos,dp processors
# nlp = Pipeline("ner,dp") # Use only ner and dp processors

The first time you use a processor, the data files of that processor are cached in the .cache folder of your home directory, so that you will not have to download them again. Each processor is about 500 MB in size, so the maximum download size can be up to 1.5 GB.

Generating the annotations

After creating the pipeline you can annotate a text by calling the pipeline's __call__ method.

doc = nlp('Η Ιταλία κέρδισε την Αγγλία στον τελικό του Euro 2020')

A Document object is then created and is annotated. The original text is tokenized and split to tokens

Accessing the annotations

The following code explains how you can access the annotations generated by the toolkit.

for token in doc.tokens:
  print(token.text) # the text of the token
  
  print(token.ner) # the named entity label in IOBES encoding : str
  
  print(token.upos) # the UPOS tag of the token
  print(token.feats) # the morphological features for the token
  
  print(token.head) # the head of the token
  print(token.deprel) # the dependency relation between the current token and its head

token.ner is set by the ner processor, token.upos and token.feats are set by the pos processor and token.head and token.deprel are set by the dp processor.

A small detail is that to get the Token object that is the head of another token you need to access doc.tokens[head-1]. The reason for this is that the enumeration of the tokens starts from 1 and when the field token.head is set to 0, that means the token is the root of the word.

Alternative download methods for the toolkit models

Currently the models are served in a Google Drive folder. In case they become unavailable from that source, the models can be found via archive.org at the following links:

The toolkit currently cannot download the models from these sources, but if you have downloaded the toolkit models via an alternative source you can place the files with their names in the .cache/gr_nlp_toolkit/ directory of your home folder (~/.cache/gr_nlp_toolkit in Linux systems). Be sure to name the Dependency Parsing model file as toolkit-dp, the Named Entity Recognition model file as toolkit-ner and the Part-of-Speech and morphological tagging model file as toolkit-pos. This way, the toolkit will not download any models from the Internet and will use the local ones instead.