You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While playing around with some sample code, I found the Lemmatizer has some odd results, the most reproducible (not requiring crafted input) is the English word us. This consistently lemmatizes to u.
Is this intentional, and if so, why? [I'm curious.]
Sample Code
importfileinputfromnltk.stemimportWordNetLemmatizerfromnltk.tokenizeimportsent_tokenize, word_tokenizelemmatizer=WordNetLemmatizer()
# take input form stdinforinputLineinfileinput.input():
# split sentencessents=sent_tokenize(inputLine)
forsentinsents:
# split wordswords=word_tokenize(sent)
forwordinwords:
# lemmatize each wordprint(lemmatizer.lemmatize(word))
# end sentenceprint("")
The text was updated successfully, but these errors were encountered:
This is not necessarily intentional, but it's indicative of the simple nature of the NLTK Lemmatizer. WordNetLemmatizer (as the name suggests) uses WordNet, and in particular the WordNet morphy function. This function (originally specified here) is a very simple suffix replacement system. For example, if there is a noun that ends with xes, then morphy will check whether that word, but xes replaced with x, exists in WordNet. If it does, then that will be one potential lemma. (Note: it also uses an exception list to avoid some common issues)
However, this is far from a "smart" system, as it does not ensure that the "potential lemma" is indeed related to the original word at all. This causes some issues for some cases, like #2567. These cannot simply be fixed with the current lemmatizer, and if we want to avoid them, then we would need to create a new lemmatizer all-together.
People are always free to donate their time to create a new one for NLTK, but at this stage nobody is specifically working on a new NLTK lemmatizer. For now I'll close this, I hope this answered your question.
Confusion
While playing around with some sample code, I found the Lemmatizer has some odd results, the most reproducible (not requiring crafted input) is the English word
us
. This consistently lemmatizes tou
.Is this intentional, and if so, why? [I'm curious.]
Sample Code
The text was updated successfully, but these errors were encountered: