Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A way to tell what tokens LatinBackOffLemmatizer() has failed to lemmatize #1194

Open
langeslag opened this issue Dec 18, 2022 · 6 comments
Open

Comments

@langeslag
Copy link

In LatinBackOffLemmatizer() and the lemmatizers in its chain I can't seem to find an option to return an empty value (such as in OldEnglishDictionaryLemmatizer()'s best_guess=False option), instead of returning the input value, when the lemmatizer fails to assign a lemma.

Without such an option, it doesn't seem possible to tell successful from unsuccessful lemmatization attempts programmatically, severely limiting the range of the lemmatizer's applications.

@clemsciences
Copy link
Member

Hello @langeslag,

OldEnglishDictionaryLemmatizer class, based on DictionaryRegexLemmatizer can show when it fails to lemmatize with an empty list, whereas LatinBackoffLemmatizer class, brilliantly implemented by @diyclassics, was actually intended not to let any void value, i. e. there is always a default value.

Fortunately, you can combine the best of the two worlds by creating your custom lemmatizer

Say me if you need help implement a custom LatinBackoffLemmatizer.

@clemsciences
Copy link
Member

clemsciences commented Dec 19, 2022

Ok, I made something that might help you.

self.backoff1 is now a DefaultLemmatizer instance that returns None if no result was found.

import os
import re
from typing import List

from cltk.lemmatize.backoff import (
    DefaultLemmatizer,
    DictLemmatizer,
    IdentityLemmatizer,
    RegexpLemmatizer,
    UnigramLemmatizer,
)
from cltk.utils import CLTK_DATA_DIR
from cltk.utils.file_operations import open_pickle
from cltk.lemmatize.lat import *

models_path = os.path.normpath(
    os.path.join(CLTK_DATA_DIR, "lat/model/lat_models_cltk/lemmata/backoff")
)



class CustomLatinBackoffLemmatizer:
    """Suggested backoff chain; includes at least on of each
    type of major sequential backoff class from backoff.py

    """

    def __init__(
        self: object, train: List[list] = None, seed: int = 3, verbose: bool = False
    ):
        self.models_path = models_path

        missing_models_message = "LatinBackoffLemmatizer requires the ```latin_models_cltk``` to be in cltk_data. Please load this corpus."

        try:
            self.train = open_pickle(
                os.path.join(self.models_path, "latin_pos_lemmatized_sents.pickle")
            )
            self.LATIN_OLD_MODEL = open_pickle(
                os.path.join(self.models_path, "latin_lemmata_cltk.pickle")
            )
            self.LATIN_MODEL = open_pickle(
                os.path.join(self.models_path, "latin_model.pickle")
            )
        except FileNotFoundError as err:
            raise type(err)(missing_models_message)

        self.latin_sub_patterns = latin_sub_patterns  # Move to latin_models_cltk

        self.seed = seed
        self.VERBOSE = verbose

        def _randomize_data(train: List[list], seed: int):
            import random

            random.seed(seed)
            random.shuffle(train)
            train_size = int(0.9 * len(train))
            pos_train_sents = train[:train_size]
            lem_train_sents = [[(item[0], item[1]) for item in sent] for sent in train]
            train_sents = lem_train_sents[:train_size]
            test_sents = lem_train_sents[train_size:]

            return pos_train_sents, train_sents, test_sents

        self.pos_train_sents, self.train_sents, self.test_sents = _randomize_data(
            self.train, self.seed
        )
        self._define_lemmatizer()

    def _define_lemmatizer(self: object):
        # Suggested backoff chain--should be tested for optimal order
        self.backoff1 = DefaultLemmatizer(verbose=self.VERBOSE)
        self.backoff2 = DictLemmatizer(
            lemmas=self.LATIN_OLD_MODEL,
            source="Morpheus Lemmas",
            backoff=self.backoff1,
            verbose=self.VERBOSE,
        )
        self.backoff3 = RegexpLemmatizer(
            self.latin_sub_patterns,
            source="CLTK Latin Regex Patterns",
            backoff=self.backoff2,
            verbose=self.VERBOSE,
        )
        self.backoff4 = UnigramLemmatizer(
            self.train_sents,
            source="CLTK Sentence Training Data",
            backoff=self.backoff3,
            verbose=self.VERBOSE,
        )
        self.backoff5 = DictLemmatizer(
            lemmas=self.LATIN_MODEL,
            source="Latin Model",
            backoff=self.backoff4,
            verbose=self.VERBOSE,
        )
        self.lemmatizer = self.backoff5

    def lemmatize(self: object, tokens: List[str]):
        lemmas = self.lemmatizer.lemmatize(tokens)
        return lemmas

    def evaluate(self: object):
        if self.VERBOSE:
            raise AssertionError(
                "evaluate() method only works when verbose: bool = False"
            )
        return self.lemmatizer.evaluate(self.test_sents)

    def __repr__(self: object):
        return f"<CustomLatinBackoffLemmatizer>"

    def __call__(self, token: str) -> str:
        return self.lemmatize([token])[0][0]

And you get:

lemmatizer = CustomLatinBackoffLemmatizer()
list(lemmatizer.lemmatize('arma virumque cano euhhhh'.split()))
[('arma', 'arma'),
 ('virumque', 'vir'),
 ('cano', 'cano'),
 ('euhhhh', None)]

@langeslag
Copy link
Author

That's immensely helpful @clemsciences , thanks a lot for putting in the extra effort!

I'll happily work with your custom class myself, but am I mistaken in thinking that having this functionality available as part of the standard LatinBackoffLemmatizer() class through an option such as best_guess=False would be an improvement? The only scenario in which I don't want this functionality is when all I want to do is reduce dimensionality regardless of recall, but in fact I am always running into scenarios in which it's important to know that the returns are in fact valid lemmas.

Either way, glad to see your solution.

@clemsciences
Copy link
Member

clemsciences commented Dec 20, 2022

What do you think of the suggestion @diyclassics?

By the way the suggested solution is not optimal at all, I could have made a child class of LatinBackoffLemmatizer that just changes _define_lemmatizer.

@clemsciences
Copy link
Member

By the way the suggested solution is not optimal at all, I could have made a child class of LatinBackoffLemmatizer that just changes _define_lemmatizer.

from cltk.lemmatize.backoff import (
    DefaultLemmatizer,
    DictLemmatizer,
    IdentityLemmatizer,
    RegexpLemmatizer,
    UnigramLemmatizer,
)

from cltk.lemmatize.lat import LatinBackoffLemmatizer


class CustomLatinBackoffLemmatizer(LatinBackoffLemmatizer):
    
    def _define_lemmatizer(self: object):
        # Suggested backoff chain--should be tested for optimal order
        self.backoff1 = DefaultLemmatizer(verbose=self.VERBOSE)
        self.backoff2 = DictLemmatizer(
            lemmas=self.LATIN_OLD_MODEL,
            source="Morpheus Lemmas",
            backoff=self.backoff1,
            verbose=self.VERBOSE,
        )
        self.backoff3 = RegexpLemmatizer(
            self.latin_sub_patterns,
            source="CLTK Latin Regex Patterns",
            backoff=self.backoff2,
            verbose=self.VERBOSE,
        )
        self.backoff4 = UnigramLemmatizer(
            self.train_sents,
            source="CLTK Sentence Training Data",
            backoff=self.backoff3,
            verbose=self.VERBOSE,
        )
        self.backoff5 = DictLemmatizer(
            lemmas=self.LATIN_MODEL,
            source="Latin Model",
            backoff=self.backoff4,
            verbose=self.VERBOSE,
        )
        self.lemmatizer = self.backoff5

    def __repr__(self: object):
        return f"<CustomLatinBackoffLemmatizer>"

lemmatizer = CustomLatinBackoffLemmatizer()
>>> list(lemmatizer.lemmatize('arma virumque cano euhhhh'.split()))
[('arma', 'arma'), ('virumque', 'vir'), ('cano', 'cano'), ('euhhhh', None)]

@diyclassics
Copy link
Collaborator

What do you think of the suggestion @diyclassics?

@clemsciences Thanks for providing the custom class—happy to consider whatever features are useful to the community. The BackoffLemmatizer was always meant to be configurable, so anything that makes reconfiguration easier is welcome.

I might at this point prefer something like a nesting pipeline... i.e. if we think of "lemmatizer" as a pipeline component, this lemmatizer could in turn be thought of as having a pipeline itself—i.e. the backoff chain—from which subcomponents—i.e. individual lemmatizers—could be added/deleted/rearranged/etc. But it would be some time before I could commit to that kind of refactoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants