gr-nlp-toolkit

A Transformer-based natural language processing toolkit for (modern) Greek. The toolkit has state-of-the art performance in Greek and supports named entity recognition, part-of-speech tagging, morphological tagging, as well as dependency parsing. For more information, please consult the following theses:

C. Dikonimaki, "A Transformer-based natural language processing toolkit for Greek -- Part of speech tagging and dependency parsing", BSc thesis, Department of Informatics, Athens University of Economics and Business, 2021. http://nlp.cs.aueb.gr/theses/dikonimaki_bsc_thesis.pdf

N. Smyrnioudis, "A Transformer-based natural language processing toolkit for Greek -- Named entity recognition and multi-task learning", BSc thesis, Department of Informatics, Athens University of Economics and Business, 2021. http://nlp.cs.aueb.gr/theses/smyrnioudis_bsc_thesis.pdf

Installation

You can install the toolkit by executing the following in the command line:

pip install gr-nlp-toolkit

Usage

To use the toolkit first initialize a Pipeline specifying which processors you need. Each processor annotates the text with a specific task's annotations.

To obtain Part-of-Speech and Morphological Tagging annotations add the pos processor
To obtain Named Entity Recognition annotations add the ner processor
To obtain Dependency Parsing annotations add the dp processor

from gr_nlp_toolkit import Pipeline
nlp = Pipeline("pos,ner,dp") # Use ner,pos,dp processors
# nlp = Pipeline("ner,dp") # Use only ner and dp processors

The first time you use a processor, the data files of that processor are cached in the .cache folder of your home directory, so that you will not have to download them again. Each processor is about 500 MB in size, so the maximum download size can be up to 1.5 GB.

Generating the annotations

After creating the pipeline you can annotate a text by calling the pipeline's __call__ method.

doc = nlp('Η Ιταλία κέρδισε την Αγγλία στον τελικό του Euro 2020')

A Document object is then created and is annotated. The original text is tokenized and split to tokens

Accessing the annotations

The following code explains how you can access the annotations generated by the toolkit.

for token in doc.tokens:
  print(token.text) # the text of the token
  
  print(token.ner) # the named entity label in IOBES encoding : str
  
  print(token.upos) # the UPOS tag of the token
  print(token.feats) # the morphological features for the token
  
  print(token.head) # the head of the token
  print(token.deprel) # the dependency relation between the current token and its head

token.ner is set by the ner processor, token.upos and token.feats are set by the pos processor and token.head and token.deprel are set by the dp processor.

A small detail is that to get the Token object that is the head of another token you need to access doc.tokens[head-1]. The reason for this is that the enumeration of the tokens starts from 1 and when the field token.head is set to 0, that means the token is the root of the word.

Alternative download methods for the toolkit models

Currently the models are served in a Google Drive folder. In case they become unavailable from that source, the models can be found via archive.org at the following links:

Dependency Parsing model: https://archive.org/details/toolkit-dp
Named Entity Recognition model: https://archive.org/details/toolkit-ner
Part-of-Speech and morphological tagging model: https://archive.org/details/toolkit-pos

The toolkit currently cannot download the models from these sources, but if you have downloaded the toolkit models via an alternative source you can place the files with their names in the .cache/gr_nlp_toolkit/ directory of your home folder (~/.cache/gr_nlp_toolkit in Linux systems). Be sure to name the Dependency Parsing model file as toolkit-dp, the Named Entity Recognition model file as toolkit-ner and the Part-of-Speech and morphological tagging model file as toolkit-pos. This way, the toolkit will not download any models from the Internet and will use the local ones instead.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
gr_nlp_toolkit		gr_nlp_toolkit
test_gr_nlp_toolkit		test_gr_nlp_toolkit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gr_nlp_toolkit

gr_nlp_toolkit

test_gr_nlp_toolkit

test_gr_nlp_toolkit

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

gr-nlp-toolkit

Installation

Usage

Generating the annotations

Accessing the annotations

Alternative download methods for the toolkit models

About

Releases

Packages

Contributors 3

Languages

License

nlpaueb/gr-nlp-toolkit

Folders and files

Latest commit

History

Repository files navigation

gr-nlp-toolkit

Installation

Usage

Generating the annotations

Accessing the annotations

Alternative download methods for the toolkit models

About

Topics

Resources

License

Stars

Watchers

Forks

Languages