[question] "us" lemmatizes into "u"? #2930

DamonBlais · 2022-01-19T11:43:41Z

Confusion

While playing around with some sample code, I found the Lemmatizer has some odd results, the most reproducible (not requiring crafted input) is the English word us. This consistently lemmatizes to u.

Is this intentional, and if so, why? [I'm curious.]

Sample Code

import fileinput

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

lemmatizer = WordNetLemmatizer()

# take input form stdin
for inputLine in fileinput.input():
    # split sentences
    sents = sent_tokenize(inputLine)
    for sent in sents:
        # split words
        words = word_tokenize(sent)
        for word in words:
            # lemmatize each word
            print(lemmatizer.lemmatize(word))
        # end sentence
        print("")

The text was updated successfully, but these errors were encountered:

tomaarsen · 2022-01-19T11:52:32Z

This is not necessarily intentional, but it's indicative of the simple nature of the NLTK Lemmatizer. WordNetLemmatizer (as the name suggests) uses WordNet, and in particular the WordNet morphy function. This function (originally specified here) is a very simple suffix replacement system. For example, if there is a noun that ends with xes, then morphy will check whether that word, but xes replaced with x, exists in WordNet. If it does, then that will be one potential lemma. (Note: it also uses an exception list to avoid some common issues)

However, this is far from a "smart" system, as it does not ensure that the "potential lemma" is indeed related to the original word at all. This causes some issues for some cases, like #2567. These cannot simply be fixed with the current lemmatizer, and if we want to avoid them, then we would need to create a new lemmatizer all-together.

People are always free to donate their time to create a new one for NLTK, but at this stage nobody is specifically working on a new NLTK lemmatizer. For now I'll close this, I hope this answered your question.

tomaarsen closed this as completed Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] "us" lemmatizes into "u"? #2930

[question] "us" lemmatizes into "u"? #2930

DamonBlais commented Jan 19, 2022

tomaarsen commented Jan 19, 2022

[question] "us" lemmatizes into "u"? #2930

[question] "us" lemmatizes into "u"? #2930

Comments

DamonBlais commented Jan 19, 2022

Confusion

Sample Code

tomaarsen commented Jan 19, 2022