Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] "us" lemmatizes into "u"? #2930

Closed
DamonBlais opened this issue Jan 19, 2022 · 1 comment
Closed

[question] "us" lemmatizes into "u"? #2930

DamonBlais opened this issue Jan 19, 2022 · 1 comment

Comments

@DamonBlais
Copy link

Confusion

While playing around with some sample code, I found the Lemmatizer has some odd results, the most reproducible (not requiring crafted input) is the English word us. This consistently lemmatizes to u.

Is this intentional, and if so, why? [I'm curious.]

Sample Code

import fileinput

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

lemmatizer = WordNetLemmatizer()

# take input form stdin
for inputLine in fileinput.input():
    # split sentences
    sents = sent_tokenize(inputLine)
    for sent in sents:
        # split words
        words = word_tokenize(sent)
        for word in words:
            # lemmatize each word
            print(lemmatizer.lemmatize(word))
        # end sentence
        print("")
@tomaarsen
Copy link
Member

This is not necessarily intentional, but it's indicative of the simple nature of the NLTK Lemmatizer. WordNetLemmatizer (as the name suggests) uses WordNet, and in particular the WordNet morphy function. This function (originally specified here) is a very simple suffix replacement system. For example, if there is a noun that ends with xes, then morphy will check whether that word, but xes replaced with x, exists in WordNet. If it does, then that will be one potential lemma. (Note: it also uses an exception list to avoid some common issues)

However, this is far from a "smart" system, as it does not ensure that the "potential lemma" is indeed related to the original word at all. This causes some issues for some cases, like #2567. These cannot simply be fixed with the current lemmatizer, and if we want to avoid them, then we would need to create a new lemmatizer all-together.

People are always free to donate their time to create a new one for NLTK, but at this stage nobody is specifically working on a new NLTK lemmatizer. For now I'll close this, I hope this answered your question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants