New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue ArabicStemmer AttributeError #1852 #1856
Fix issue ArabicStemmer AttributeError #1852 #1856
Conversation
d488d61
to
468edb6
Compare
@LBenzahia would you please share the list of stopwords as a newline-delimited list of words, and I'll add it to the NLTK stopwords corpus, and you can access it there. |
6b58c8c
to
59dc98f
Compare
59dc98f
to
49d76a2
Compare
Indeed, I've added the stopwords list see this PR |
@LBenzahia: Thanks, I've added the Arabic stopwords to the NLTK corpus collection. |
With the new stopwords and the CI retest, the tests passed. @LBenzahia Could you help to take a look and is the PR set for a final review before merge? |
@alvations Thanks , I've tested it locally LGTM 👍, If there's any problem let me know to fix it. |
@@ -15,14 +15,16 @@ def test_arabic(self): | |||
this unit testing for test the snowball arabic light stemmer | |||
this stemmer deals with prefixes and suffixes | |||
""" | |||
ar_stemmer = SnowballStemmer("arabic") | |||
ar_stemmer = SnowballStemmer("arabic", True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add another test where the ignore_stopwords=False
.
if self.is_verb: | ||
modified_word = self.__Suffix_Verb_Step1(modified_word) | ||
if self.suffixes_verb_step1_success: | ||
modified_word = self.__Suffix_Verb_Step2a(modified_word) | ||
if not self.suffix_verb_step2a_success : | ||
modified_word = self.__Suffix_Verb_Step2c(modified_word) | ||
#or next | ||
#or next TODO: How to deal with or next instruction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which cases would there be more steps that needs to be applied here? Perhaps, it'll be good to list these cases down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're working on it and other todos when we solve them we'll send PR for updates
"or next" i mean this line from the original algorithm as you know i've rewrote the algorithm by hand to follow nltk guideline style code and avoid the generated code from snowball generator.
You can take a look at this list of issues and todos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a link to the assem-ch/arabicstemmer#1 in the github comment too? That'll be helpful for us to track later. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created an issue in nltk to track the changes later and added a comment in assem-ch/arabicstemmer#1 , I hope this is helpful, sorry for the late replay.
modified_word = self.__normalize_pre(modified_word) | ||
# Avoid stopwords | ||
if modified_word in self.stopwords or len(modified_word) <= 2: | ||
return modified_word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -516,7 +516,7 @@ def __Suffix_Verb_Step1(self, token): | |||
|
|||
def __Suffix_Verb_Step2a(self, token): | |||
for suffix in self.__suffix_verb_step2a: | |||
if token.endswith(suffix): | |||
if token.endswith(suffix) and len(token) > 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, is there a linguistic reason to avoid words with 2 characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't study the case of words that have 2 length yet, We're mentioned it in the list of our todos too.
@LBenzahia: I think we're waiting on more input from you before merging. |
@stevenbird, I've created an issue for that and linked it with our milestone for improving snowball ArabicStemmer in the original repo of the stemmer, Sorry for the late replay. |
[CI: retest] |
What's still in this pr to be merged? |
[CI: retest] |
@stevenbird @assem-ch I think it LGTM if no one else objects. |
Thanks @LBenzahia |
Hi @alvations, @richbalmer