You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
There have been buggy examples so far of lemmatization (which affect definition lookups by extension) in several of the Latin sentences I've tried analyzing (could be just bad luck?).
To Reproduce
Steps to reproduce the behavior:
Install Python version 3.9
Install CLTK version 1.2.1
In a script or REPL, run the following code:
Example 1:
fromcltkimportNLPcltk_nlp=NLP(language="lat")
input_text='quae res in civitate duae plurimum possunt'doc=cltk_nlp.analyze(text=input_text)
print(doc.words[5].lemma) # should be "plurimus", is "multum"
Example 2:
fromcltkimportNLPcltk_nlp=NLP(language="lat")
input_text='quotiens cumque verantem tuum serenissimum que vultum intueor'doc=cltk_nlp.analyze(text=input_text)
print(doc.words[2].lemma) # should be "vero", is "veror"
Expected behavior
The lemmas shown in the code examples seem to be quite off; Example 2 seems like a minor regex issue if I had to guess by looking at the code (being bucketized as having an "or" ending when it should just be "o"), whereas somehow Example 1 appears to be lemmatizing as a completely different noun.
Desktop (please complete the following information):
OS and version: WSL with Ubuntu 20.04.1 LTS on Windows 10.0.19045
Additional context
The code is being used in a Flask web server so the code has been simplified, but these examples should identically represent what's happening assuming a default installation of said versions and downloading of default model files etc. Hopefully I haven't overlooked something silly! Overall though POS data etc. all seems to look okay, so shouldn't be anything catastrophically wrong with the setup?
The text was updated successfully, but these errors were encountered:
Hello @coltonoscopy, sorry for the late of my response.
Example 1
Plurimus is the superlative form of multus and a lemma is the basic form of a word. In Latin, a lemma of an adjective is given by its singular masculine positive-grade form. Even if plurimus entry may be found in a Latin dictionary, it is still a derivative (in that case, suppletive) form of multus.
Example 2
Verantem, present participle of the verb vero. In my run, I get verans as lemma. This is surprising because a lemma of a present participle is the singular first-person form of the verb at the indicative mood.
The results depend on how lemmas were defined in the training set. How can we fix that? Maybe with some rule-based dictionary?
I don't see a better solution for now. Please reopen the issue if you want to share your ideas.
Describe the bug
There have been buggy examples so far of lemmatization (which affect definition lookups by extension) in several of the Latin sentences I've tried analyzing (could be just bad luck?).
To Reproduce
Steps to reproduce the behavior:
Example 1:
Example 2:
Expected behavior
The lemmas shown in the code examples seem to be quite off; Example 2 seems like a minor regex issue if I had to guess by looking at the code (being bucketized as having an "or" ending when it should just be "o"), whereas somehow Example 1 appears to be lemmatizing as a completely different noun.
Desktop (please complete the following information):
Additional context
The code is being used in a Flask web server so the code has been simplified, but these examples should identically represent what's happening assuming a default installation of said versions and downloading of default model files etc. Hopefully I haven't overlooked something silly! Overall though POS data etc. all seems to look okay, so shouldn't be anything catastrophically wrong with the setup?
The text was updated successfully, but these errors were encountered: