Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue ArabicStemmer AttributeError #1852 #1856

Merged
merged 2 commits into from Oct 21, 2018

Conversation

greenat92
Copy link
Contributor

Hi @alvations, @richbalmer

  1. Add arabic stop word list.
  2. Fix issue ArabicStemmer AttributeError ArabicStemmer AttributeError #1852

@greenat92 greenat92 force-pushed the fix/arabicstemmer-attributeError branch 2 times, most recently from d488d61 to 468edb6 Compare October 13, 2017 16:59
@stevenbird
Copy link
Member

@LBenzahia would you please share the list of stopwords as a newline-delimited list of words, and I'll add it to the NLTK stopwords corpus, and you can access it there.

@stevenbird stevenbird self-assigned this Oct 14, 2017
@greenat92 greenat92 force-pushed the fix/arabicstemmer-attributeError branch 2 times, most recently from 6b58c8c to 59dc98f Compare October 16, 2017 11:16
@greenat92 greenat92 force-pushed the fix/arabicstemmer-attributeError branch from 59dc98f to 49d76a2 Compare October 16, 2017 11:17
@greenat92
Copy link
Contributor Author

greenat92 commented Oct 16, 2017

@stevenbird,

would you please share the list of stopwords as a newline-delimited list of words

Indeed, I've added the stopwords list see this PR

@alvations alvations added this to the 3.2.6 milestone Oct 16, 2017
@stevenbird
Copy link
Member

@LBenzahia: Thanks, I've added the Arabic stopwords to the NLTK corpus collection.

@alvations
Copy link
Contributor

With the new stopwords and the CI retest, the tests passed.

@LBenzahia Could you help to take a look and is the PR set for a final review before merge?

@greenat92
Copy link
Contributor Author

@alvations Thanks , I've tested it locally LGTM 👍, If there's any problem let me know to fix it.

@@ -15,14 +15,16 @@ def test_arabic(self):
this unit testing for test the snowball arabic light stemmer
this stemmer deals with prefixes and suffixes
"""
ar_stemmer = SnowballStemmer("arabic")
ar_stemmer = SnowballStemmer("arabic", True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add another test where the ignore_stopwords=False.

if self.is_verb:
modified_word = self.__Suffix_Verb_Step1(modified_word)
if self.suffixes_verb_step1_success:
modified_word = self.__Suffix_Verb_Step2a(modified_word)
if not self.suffix_verb_step2a_success :
modified_word = self.__Suffix_Verb_Step2c(modified_word)
#or next
#or next TODO: How to deal with or next instruction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases would there be more steps that needs to be applied here? Perhaps, it'll be good to list these cases down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're working on it and other todos when we solve them we'll send PR for updates
"or next" i mean this line from the original algorithm as you know i've rewrote the algorithm by hand to follow nltk guideline style code and avoid the generated code from snowball generator.
You can take a look at this list of issues and todos.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a link to the assem-ch/arabicstemmer#1 in the github comment too? That'll be helpful for us to track later. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created an issue in nltk to track the changes later and added a comment in assem-ch/arabicstemmer#1 , I hope this is helpful, sorry for the late replay.

modified_word = self.__normalize_pre(modified_word)
# Avoid stopwords
if modified_word in self.stopwords or len(modified_word) <= 2:
return modified_word
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -516,7 +516,7 @@ def __Suffix_Verb_Step1(self, token):

def __Suffix_Verb_Step2a(self, token):
for suffix in self.__suffix_verb_step2a:
if token.endswith(suffix):
if token.endswith(suffix) and len(token) > 3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, is there a linguistic reason to avoid words with 2 characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't study the case of words that have 2 length yet, We're mentioned it in the list of our todos too.

@stevenbird
Copy link
Member

@LBenzahia: I think we're waiting on more input from you before merging.

@greenat92
Copy link
Contributor Author

greenat92 commented Dec 20, 2017

@stevenbird, I've created an issue for that and linked it with our milestone for improving snowball ArabicStemmer in the original repo of the stemmer, Sorry for the late replay.
Thank you!

@alvations
Copy link
Contributor

[CI: retest]

@assem-ch
Copy link

assem-ch commented Mar 5, 2018

What's still in this pr to be merged?

@alvations
Copy link
Contributor

[CI: retest]

@alvations
Copy link
Contributor

@stevenbird @assem-ch I think it LGTM if no one else objects.

@alvations alvations removed this from the 3.2.6 milestone Aug 28, 2018
@stevenbird stevenbird merged commit e84c526 into nltk:develop Oct 21, 2018
@stevenbird
Copy link
Member

Thanks @LBenzahia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants