ArabicStemmer AttributeError #1852

richbalmer · 2017-10-11T11:04:43Z

I'm failing to stem certain Arabic terms using the SnowballStemmer. Many terms are stemmed successfully but some terms cause an AttributeError to be raised. Please see below for a minimal example that fails on the term 'from'.

(anaconda2-4.4.0) richard-balmer-macbook:~ richardbalmer$ pip freeze | grep nltk
nltk==3.2.5
(anaconda2-4.4.0) richard-balmer-macbook:~ richardbalmer$ ipython
Python 2.7.13 |Anaconda custom (x86_64)| (default, Dec 20 2016, 23:05:08)
Type "copyright", "credits" or "license" for more information.

IPython 5.3.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from nltk.stem.snowball import SnowballStemmer

In [2]: stemmer = SnowballStemmer('arabic')

In [3]: stemmer.stem(u'تسدد')
Out[3]: u'\u062a\u0633\u062f\u062f'

In [4]: stemmer.stem(u'من')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-ffa733106049> in <module>()
----> 1 stemmer.stem(u'من')

/Users/richardbalmer/.pyenv/versions/anaconda2-4.4.0/lib/python2.7/site-packages/nltk/stem/snowball.pyc in stem(self, word)
    762                 modified_word = self.__Suffix_Verb_Step2b(modified_word)
    763                 if not self.suffix_verb_step2b_success:
--> 764                     modified_word = self.__Suffix_Verb_Step2a(modified_word)
    765         if self.is_noun:
    766             modified_word = self.__Suffix_Noun_Step2c2(modified_word)

/Users/richardbalmer/.pyenv/versions/anaconda2-4.4.0/lib/python2.7/site-packages/nltk/stem/snowball.pyc in __Suffix_Verb_Step2a(self, token)
    533                     break
    534
--> 535                 if suffix in self.__conjugation_suffix_verb_present and len(token) > 5:
    536                     token = token[:-2]  # present
    537                     self.suffix_verb_step2a_success = True

AttributeError: 'ArabicStemmer' object has no attribute '_ArabicStemmer__conjugation_suffix_verb_present'

The text was updated successfully, but these errors were encountered:

alvations · 2017-10-13T08:34:01Z

@richbalmer Thanks for reporting the issue.

@LBenzahia Could you help to look into this? Thanks in advance!

…tk#1852

greenat92 · 2017-10-13T13:41:39Z

Hi @richbalmer thank you for reporting, First word 'تسدد' is the best possible stem because Snowball arabic stemmer based on light stemming algorithm deals with prefixes/suffixes, if you are looking for the root of "تسدد" you can use ISRI (root-based stemmer/deep stemming), The second word 'من' is a stop word, you should use stop word filter before start using Snowball ArabicStemmer, Also this stemmer doesn't deal with the case when the word have 2 letters.
Anyways, I've fixed the problem in this PR #1856.
Thank you again !

richbalmer · 2017-10-13T16:46:51Z

@LBenzahia thanks for looking into this so quickly! I'm getting:

  File "/Users/richardbalmer/src/nltk/nltk/stem/util.py", line 24
    arabic_stopwords = ['إذ',
                             ^
SyntaxError: Non-ASCII character '\xd8' in file /Users/richardbalmer/src/nltk/nltk/stem/util.py on line 24, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Which also appears to be causing the tests to fail on Jenkins (https://nltk.ci.cloudbees.com/job/pull_request_tests/454/TOXENV=py27-jenkins,jdk=jdk8latestOnlineInstall/testReport/nose.failure/Failure/runTest/). I think all you need to do is put # -*- coding: utf-8 -*- at the top of stem/util.py.

Also, after fixing that locally I get a UnicodeWarning:

/Users/richardbalmer/src/nltk/nltk/stem/snowball.py:748: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if word in arabic_stopwords:

It might be worth making those stopwords unicode strings.

Other than that it looks like your fix works nicely for me - thanks again!

p.s. One other suggestion: testing set inclusion is quite a lot faster than list inclusion, so it might be worth making that stopword list a set instead.

…tk#1852

greenat92 · 2017-10-13T16:56:15Z

@richbalmer are you using python2.7 ? ,

It might be worth making those stopwords unicode strings.

done for python2.7 , test it again and tell me,It works fine for me. i've updated the PR

…tk#1852

richbalmer · 2017-10-16T11:07:56Z

Yup I'm using 2.7. Looking good @LBenzahia - thanks again!

…tk#1852

NouraAls · 2018-02-18T11:36:33Z

Still having the error :
AttributeError: 'ArabicStemmer' object has no attribute '_ArabicStemmer__conjugation_suffix_verb_present'

I'm using python 3

greenat92 · 2018-02-18T12:29:48Z

@NouraAls solved in PR

Fix issue ArabicStemmer AttributeError #1852

alvations added bug pleaseverify tests labels Oct 13, 2017

greenat92 added a commit to greenat92/nltk that referenced this issue Oct 13, 2017

add arabic stopwords list / fix issue ArabicStemmer AttributeError nl…

2414fbc

…tk#1852

greenat92 mentioned this issue Oct 13, 2017

Fix issue ArabicStemmer AttributeError #1852 #1856

Merged

greenat92 added a commit to greenat92/nltk that referenced this issue Oct 13, 2017

add arabic stopwords list / fix issue ArabicStemmer AttributeError nl…

d488d61

…tk#1852

greenat92 added a commit to greenat92/nltk that referenced this issue Oct 13, 2017

add arabic stopwords list / fix issue ArabicStemmer AttributeError nl…

468edb6

…tk#1852

alvations added this to the 3.2.6 milestone Oct 14, 2017

greenat92 added a commit to greenat92/nltk that referenced this issue Oct 16, 2017

add arabic stopwords list / fix issue ArabicStemmer AttributeError nl…

6b58c8c

…tk#1852

greenat92 added a commit to greenat92/nltk that referenced this issue Oct 16, 2017

add arabic stopwords list / fix issue ArabicStemmer AttributeError nl…

59dc98f

…tk#1852

greenat92 added a commit to greenat92/nltk that referenced this issue Oct 16, 2017

add arabic stopwords list / fix issue ArabicStemmer AttributeError nl…

49d76a2

…tk#1852

alvations added the resolved label Feb 15, 2018

stevenbird closed this as completed Apr 1, 2018

stevenbird added a commit that referenced this issue Oct 21, 2018

Merge pull request #1856 from LBenzahia/fix/arabicstemmer-attributeError

e84c526

Fix issue ArabicStemmer AttributeError #1852

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArabicStemmer AttributeError #1852

ArabicStemmer AttributeError #1852

richbalmer commented Oct 11, 2017

alvations commented Oct 13, 2017

greenat92 commented Oct 13, 2017 •

edited

richbalmer commented Oct 13, 2017

greenat92 commented Oct 13, 2017 •

edited

richbalmer commented Oct 16, 2017

NouraAls commented Feb 18, 2018

greenat92 commented Feb 18, 2018

ArabicStemmer AttributeError #1852

ArabicStemmer AttributeError #1852

Comments

richbalmer commented Oct 11, 2017

alvations commented Oct 13, 2017

greenat92 commented Oct 13, 2017 • edited

richbalmer commented Oct 13, 2017

greenat92 commented Oct 13, 2017 • edited

richbalmer commented Oct 16, 2017

NouraAls commented Feb 18, 2018

greenat92 commented Feb 18, 2018

greenat92 commented Oct 13, 2017 •

edited

greenat92 commented Oct 13, 2017 •

edited