Fix issue ArabicStemmer AttributeError #1852 #1856

greenat92 · 2017-10-13T13:40:11Z

Add arabic stop word list.
Fix issue ArabicStemmer AttributeError ArabicStemmer AttributeError #1852

stevenbird · 2017-10-14T22:13:09Z

@LBenzahia would you please share the list of stopwords as a newline-delimited list of words, and I'll add it to the NLTK stopwords corpus, and you can access it there.

…tk#1852

greenat92 · 2017-10-16T11:19:38Z

@stevenbird,

would you please share the list of stopwords as a newline-delimited list of words

Indeed, I've added the stopwords list see this PR

stevenbird · 2017-10-17T23:33:57Z

@LBenzahia: Thanks, I've added the Arabic stopwords to the NLTK corpus collection.

alvations · 2017-11-17T07:20:26Z

With the new stopwords and the CI retest, the tests passed.

@LBenzahia Could you help to take a look and is the PR set for a final review before merge?

greenat92 · 2017-11-17T08:57:59Z

@alvations Thanks , I've tested it locally LGTM 👍, If there's any problem let me know to fix it.

alvations · 2017-11-17T10:00:19Z

nltk/test/unit/test_stem.py

@@ -15,14 +15,16 @@ def test_arabic(self):
        this unit testing for test the snowball arabic light stemmer
        this stemmer deals with prefixes and suffixes
        """
-        ar_stemmer = SnowballStemmer("arabic")
+        ar_stemmer = SnowballStemmer("arabic", True)


Please add another test where the ignore_stopwords=False.

alvations · 2017-11-17T10:01:29Z

nltk/stem/snowball.py

        if self.is_verb:
            modified_word = self.__Suffix_Verb_Step1(modified_word)
            if  self.suffixes_verb_step1_success:
                modified_word = self.__Suffix_Verb_Step2a(modified_word)
                if not self.suffix_verb_step2a_success :
                    modified_word = self.__Suffix_Verb_Step2c(modified_word)
-                #or next
+                #or next TODO: How to deal with or next instruction


In which cases would there be more steps that needs to be applied here? Perhaps, it'll be good to list these cases down.

We're working on it and other todos when we solve them we'll send PR for updates
"or next" i mean this line from the original algorithm as you know i've rewrote the algorithm by hand to follow nltk guideline style code and avoid the generated code from snowball generator.
You can take a look at this list of issues and todos.

Could you add a link to the assem-ch/arabicstemmer#1 in the github comment too? That'll be helpful for us to track later. Thanks!

I've created an issue in nltk to track the changes later and added a comment in assem-ch/arabicstemmer#1 , I hope this is helpful, sorry for the late replay.

alvations · 2017-11-17T10:02:14Z

nltk/stem/snowball.py

        modified_word = self.__normalize_pre(modified_word)
+        # Avoid stopwords
+        if modified_word in self.stopwords or len(modified_word) <= 2:
+            return modified_word


alvations · 2017-11-17T10:02:54Z

nltk/stem/snowball.py

@@ -516,7 +516,7 @@ def __Suffix_Verb_Step1(self, token):

    def __Suffix_Verb_Step2a(self, token):
        for suffix in self.__suffix_verb_step2a:
-            if token.endswith(suffix):
+            if token.endswith(suffix) and len(token) > 3:


Just out of curiosity, is there a linguistic reason to avoid words with 2 characters?

We didn't study the case of words that have 2 length yet, We're mentioned it in the list of our todos too.

stevenbird · 2017-12-20T10:47:41Z

@LBenzahia: I think we're waiting on more input from you before merging.

greenat92 · 2017-12-20T11:16:20Z

@stevenbird, I've created an issue for that and linked it with our milestone for improving snowball ArabicStemmer in the original repo of the stemmer, Sorry for the late replay.
Thank you!

alvations · 2018-02-15T03:42:18Z

[CI: retest]

assem-ch · 2018-03-05T20:50:13Z

What's still in this pr to be merged?

alvations · 2018-07-26T01:01:50Z

[CI: retest]

alvations · 2018-07-26T01:02:08Z

@stevenbird @assem-ch I think it LGTM if no one else objects.

stevenbird · 2018-10-21T10:19:21Z

Thanks @LBenzahia

greenat92 mentioned this pull request Oct 13, 2017

ArabicStemmer AttributeError #1852

Closed

greenat92 force-pushed the fix/arabicstemmer-attributeError branch 2 times, most recently from d488d61 to 468edb6 Compare October 13, 2017 16:59

stevenbird self-assigned this Oct 14, 2017

greenat92 force-pushed the fix/arabicstemmer-attributeError branch 2 times, most recently from 6b58c8c to 59dc98f Compare October 16, 2017 11:16

add arabic stopwords list / fix issue ArabicStemmer AttributeError nl…

49d76a2

…tk#1852

greenat92 force-pushed the fix/arabicstemmer-attributeError branch from 59dc98f to 49d76a2 Compare October 16, 2017 11:17

alvations added this to the 3.2.6 milestone Oct 16, 2017

alvations reviewed Nov 17, 2017

View reviewed changes

add unit tests where we don't ignore stopwords

c818bb5

alvations added stem/lemma corpus labels Nov 23, 2017

alvations approved these changes Feb 15, 2018

View reviewed changes

alvations removed this from the 3.2.6 milestone Aug 28, 2018

stevenbird merged commit e84c526 into nltk:develop Oct 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue ArabicStemmer AttributeError #1852 #1856

Fix issue ArabicStemmer AttributeError #1852 #1856

greenat92 commented Oct 13, 2017

stevenbird commented Oct 14, 2017

greenat92 commented Oct 16, 2017 •

edited

stevenbird commented Oct 17, 2017

alvations commented Nov 17, 2017

greenat92 commented Nov 17, 2017

alvations Nov 17, 2017

alvations Nov 17, 2017

greenat92 Nov 17, 2017

alvations Nov 23, 2017

greenat92 Dec 20, 2017

alvations Nov 17, 2017

alvations Nov 17, 2017

greenat92 Nov 17, 2017

stevenbird commented Dec 20, 2017

greenat92 commented Dec 20, 2017 •

edited

alvations commented Feb 15, 2018

assem-ch commented Mar 5, 2018 •

edited

alvations commented Jul 26, 2018

alvations commented Jul 26, 2018

stevenbird commented Oct 21, 2018

Fix issue ArabicStemmer AttributeError #1852 #1856

Fix issue ArabicStemmer AttributeError #1852 #1856

Conversation

greenat92 commented Oct 13, 2017

stevenbird commented Oct 14, 2017

greenat92 commented Oct 16, 2017 • edited

stevenbird commented Oct 17, 2017

alvations commented Nov 17, 2017

greenat92 commented Nov 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenbird commented Dec 20, 2017

greenat92 commented Dec 20, 2017 • edited

alvations commented Feb 15, 2018

assem-ch commented Mar 5, 2018 • edited

alvations commented Jul 26, 2018

alvations commented Jul 26, 2018

stevenbird commented Oct 21, 2018

greenat92 commented Oct 16, 2017 •

edited

greenat92 commented Dec 20, 2017 •

edited

assem-ch commented Mar 5, 2018 •

edited