GitHub - EMBEDDIA/lemmagen3: A Python2/3 wrapper for the Lemmagen lemmatizer supporting 19 languages

About

lemmagen3 is a Python 2/3 wrapper for the Lemmagen lemmatizer (version 2.2).

It is different from other Lemmagen wrappers like this one on PyPi because it offers a clean, fast OO interface built with the excellent pybind11 library and supports an additional language (Croatian).

Models for Slovene, Croatian and Serbian are significantly updated and make use of frequency data to prefer most frequent lemmas, e.g., for Slovene: je->biti instead of je->jesti, mene->jaz instead od mene->mena, od->od instead of od->oda etc.

In total, 19 languages are supported:

Bulgarian: bg
Croatian: hr
Czech: cs
English: en
Estonian: et
Farsi/Persian: fa
French: fr
German: de
Hungarian: hu
Italian: it
Macedonian: mk
Polish: pl
Romanian: ro
Russian: ru
Serbian: sr
Slovak: sk
Slovene: sl
Spanish: es
Ukrainian: uk

Installation and requirements

pip install lemmagen3

will install the module and language model files. Please note that on python <=3.5 and python 2.7 the package will be built from source so you will need a C++ compiler.

Note: If you use python 3.5.0 or 3.5.1 you will likely get the error shown below. This is a known bug in these two versions so please consider upgrading your Python.

ImportError: ..._lemmagen.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _PyThreadState_UncheckedGet

How to use

The following snippet illustrates how to use lemmagen3.

from lemmagen3 import Lemmatizer

# first, we can list all supported languages
print(Lemmatizer.list_supported_languages())

# then, create few lemmatizer objects using ISO 639-1 language codes
# (English, Slovene and Russian)

lem_en = Lemmatizer('en')
lem_sl = Lemmatizer('sl')
lem_ru = Lemmatizer('ru')

# now lemmatize the word "cats" in all three languages
print(lem_en.lemmatize('cats'))
print(lem_sl.lemmatize('mačke'))
print(lem_ru.lemmatize('коты'))

# you can also change the language for an existing Lemmatizer object
# lem_en will now become a French lemmatizer:
lem_en.load_language('fr')

# finally, you can also load your own Lemmagen model
my_lem = Lemmatizer()
my_lem.load_model('/path/to/my/model')

Note that the function lemmatize accepts single string tokens and does not split the input string! If you want to lemmatize a chunk of text you will have to tokenize it first, e.g.:

sentence = 'cats hate dogs'
tokens = sentence.split()
sentence_lemmatized = ' '.join([lem_en.lemmatize(token) for token in tokens])

Note also that lemmagen3 operates on unicode encoded strings so if you use python 2 make sure that your input string is encoded as unicode.

License

Please note that this repository contains code and binary models compiled and built from different sources which are under different licenses:

C++ files and headers come from Lemmagen and are modified and adapted to work as a Python module (LGPL)
Binary models are built from Multext and Multext-east sources:
- Language resources used to build Farsi/Persian, Macedonian, Polish, and Russian models are for non-commercial use only.
- Language resource for other supported languages are released under CC BY-SA 4.0.

The rest of the code in this repository was created by the author and is licensed under the MIT license.

Authors

lemmagen3 is developed by Vid Podpečan (vid.podpecan@ijs.si).
The Lemmagen lemmatizer was developed by Matjaž Juršič.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
src/lemmagen3		src/lemmagen3
test		test
.gitignore		.gitignore
.travis.yml.old		.travis.yml.old
CHANGELOG		CHANGELOG
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

src/lemmagen3

src/lemmagen3

test

test

.gitignore

.gitignore

.travis.yml.old

.travis.yml.old

CHANGELOG

CHANGELOG

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

README.md

README.md

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

About

Installation and requirements

How to use

License

Authors

About

Releases

Packages

Contributors 3

Languages

License

EMBEDDIA/lemmagen3

Folders and files

Latest commit

History

Repository files navigation

About

Installation and requirements

How to use

License

Authors

About

Resources

License

Stars

Watchers

Forks

Languages