-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A way to tell what tokens LatinBackOffLemmatizer()
has failed to lemmatize
#1194
Comments
Hello @langeslag,
Fortunately, you can combine the best of the two worlds by creating your custom lemmatizer Say me if you need help implement a custom |
Ok, I made something that might help you.
import os
import re
from typing import List
from cltk.lemmatize.backoff import (
DefaultLemmatizer,
DictLemmatizer,
IdentityLemmatizer,
RegexpLemmatizer,
UnigramLemmatizer,
)
from cltk.utils import CLTK_DATA_DIR
from cltk.utils.file_operations import open_pickle
from cltk.lemmatize.lat import *
models_path = os.path.normpath(
os.path.join(CLTK_DATA_DIR, "lat/model/lat_models_cltk/lemmata/backoff")
)
class CustomLatinBackoffLemmatizer:
"""Suggested backoff chain; includes at least on of each
type of major sequential backoff class from backoff.py
"""
def __init__(
self: object, train: List[list] = None, seed: int = 3, verbose: bool = False
):
self.models_path = models_path
missing_models_message = "LatinBackoffLemmatizer requires the ```latin_models_cltk``` to be in cltk_data. Please load this corpus."
try:
self.train = open_pickle(
os.path.join(self.models_path, "latin_pos_lemmatized_sents.pickle")
)
self.LATIN_OLD_MODEL = open_pickle(
os.path.join(self.models_path, "latin_lemmata_cltk.pickle")
)
self.LATIN_MODEL = open_pickle(
os.path.join(self.models_path, "latin_model.pickle")
)
except FileNotFoundError as err:
raise type(err)(missing_models_message)
self.latin_sub_patterns = latin_sub_patterns # Move to latin_models_cltk
self.seed = seed
self.VERBOSE = verbose
def _randomize_data(train: List[list], seed: int):
import random
random.seed(seed)
random.shuffle(train)
train_size = int(0.9 * len(train))
pos_train_sents = train[:train_size]
lem_train_sents = [[(item[0], item[1]) for item in sent] for sent in train]
train_sents = lem_train_sents[:train_size]
test_sents = lem_train_sents[train_size:]
return pos_train_sents, train_sents, test_sents
self.pos_train_sents, self.train_sents, self.test_sents = _randomize_data(
self.train, self.seed
)
self._define_lemmatizer()
def _define_lemmatizer(self: object):
# Suggested backoff chain--should be tested for optimal order
self.backoff1 = DefaultLemmatizer(verbose=self.VERBOSE)
self.backoff2 = DictLemmatizer(
lemmas=self.LATIN_OLD_MODEL,
source="Morpheus Lemmas",
backoff=self.backoff1,
verbose=self.VERBOSE,
)
self.backoff3 = RegexpLemmatizer(
self.latin_sub_patterns,
source="CLTK Latin Regex Patterns",
backoff=self.backoff2,
verbose=self.VERBOSE,
)
self.backoff4 = UnigramLemmatizer(
self.train_sents,
source="CLTK Sentence Training Data",
backoff=self.backoff3,
verbose=self.VERBOSE,
)
self.backoff5 = DictLemmatizer(
lemmas=self.LATIN_MODEL,
source="Latin Model",
backoff=self.backoff4,
verbose=self.VERBOSE,
)
self.lemmatizer = self.backoff5
def lemmatize(self: object, tokens: List[str]):
lemmas = self.lemmatizer.lemmatize(tokens)
return lemmas
def evaluate(self: object):
if self.VERBOSE:
raise AssertionError(
"evaluate() method only works when verbose: bool = False"
)
return self.lemmatizer.evaluate(self.test_sents)
def __repr__(self: object):
return f"<CustomLatinBackoffLemmatizer>"
def __call__(self, token: str) -> str:
return self.lemmatize([token])[0][0] And you get: lemmatizer = CustomLatinBackoffLemmatizer()
list(lemmatizer.lemmatize('arma virumque cano euhhhh'.split()))
|
That's immensely helpful @clemsciences , thanks a lot for putting in the extra effort! I'll happily work with your custom class myself, but am I mistaken in thinking that having this functionality available as part of the standard Either way, glad to see your solution. |
What do you think of the suggestion @diyclassics? By the way the suggested solution is not optimal at all, I could have made a child class of |
from cltk.lemmatize.backoff import (
DefaultLemmatizer,
DictLemmatizer,
IdentityLemmatizer,
RegexpLemmatizer,
UnigramLemmatizer,
)
from cltk.lemmatize.lat import LatinBackoffLemmatizer
class CustomLatinBackoffLemmatizer(LatinBackoffLemmatizer):
def _define_lemmatizer(self: object):
# Suggested backoff chain--should be tested for optimal order
self.backoff1 = DefaultLemmatizer(verbose=self.VERBOSE)
self.backoff2 = DictLemmatizer(
lemmas=self.LATIN_OLD_MODEL,
source="Morpheus Lemmas",
backoff=self.backoff1,
verbose=self.VERBOSE,
)
self.backoff3 = RegexpLemmatizer(
self.latin_sub_patterns,
source="CLTK Latin Regex Patterns",
backoff=self.backoff2,
verbose=self.VERBOSE,
)
self.backoff4 = UnigramLemmatizer(
self.train_sents,
source="CLTK Sentence Training Data",
backoff=self.backoff3,
verbose=self.VERBOSE,
)
self.backoff5 = DictLemmatizer(
lemmas=self.LATIN_MODEL,
source="Latin Model",
backoff=self.backoff4,
verbose=self.VERBOSE,
)
self.lemmatizer = self.backoff5
def __repr__(self: object):
return f"<CustomLatinBackoffLemmatizer>"
lemmatizer = CustomLatinBackoffLemmatizer() >>> list(lemmatizer.lemmatize('arma virumque cano euhhhh'.split()))
[('arma', 'arma'), ('virumque', 'vir'), ('cano', 'cano'), ('euhhhh', None)] |
@clemsciences Thanks for providing the custom class—happy to consider whatever features are useful to the community. The BackoffLemmatizer was always meant to be configurable, so anything that makes reconfiguration easier is welcome. I might at this point prefer something like a nesting pipeline... i.e. if we think of "lemmatizer" as a pipeline component, this lemmatizer could in turn be thought of as having a pipeline itself—i.e. the backoff chain—from which subcomponents—i.e. individual lemmatizers—could be added/deleted/rearranged/etc. But it would be some time before I could commit to that kind of refactoring. |
In
LatinBackOffLemmatizer()
and the lemmatizers in its chain I can't seem to find an option to return an empty value (such as inOldEnglishDictionaryLemmatizer()
'sbest_guess=False
option), instead of returning the input value, when the lemmatizer fails to assign a lemma.Without such an option, it doesn't seem possible to tell successful from unsuccessful lemmatization attempts programmatically, severely limiting the range of the lemmatizer's applications.
The text was updated successfully, but these errors were encountered: