Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArabicStemmer AttributeError #1852

Closed
richbalmer opened this issue Oct 11, 2017 · 7 comments
Closed

ArabicStemmer AttributeError #1852

richbalmer opened this issue Oct 11, 2017 · 7 comments

Comments

@richbalmer
Copy link

I'm failing to stem certain Arabic terms using the SnowballStemmer. Many terms are stemmed successfully but some terms cause an AttributeError to be raised. Please see below for a minimal example that fails on the term 'from'.

(anaconda2-4.4.0) richard-balmer-macbook:~ richardbalmer$ pip freeze | grep nltk
nltk==3.2.5
(anaconda2-4.4.0) richard-balmer-macbook:~ richardbalmer$ ipython
Python 2.7.13 |Anaconda custom (x86_64)| (default, Dec 20 2016, 23:05:08)
Type "copyright", "credits" or "license" for more information.

IPython 5.3.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from nltk.stem.snowball import SnowballStemmer

In [2]: stemmer = SnowballStemmer('arabic')

In [3]: stemmer.stem(u'تسدد')
Out[3]: u'\u062a\u0633\u062f\u062f'

In [4]: stemmer.stem(u'من')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-ffa733106049> in <module>()
----> 1 stemmer.stem(u'من')

/Users/richardbalmer/.pyenv/versions/anaconda2-4.4.0/lib/python2.7/site-packages/nltk/stem/snowball.pyc in stem(self, word)
    762                 modified_word = self.__Suffix_Verb_Step2b(modified_word)
    763                 if not self.suffix_verb_step2b_success:
--> 764                     modified_word = self.__Suffix_Verb_Step2a(modified_word)
    765         if self.is_noun:
    766             modified_word = self.__Suffix_Noun_Step2c2(modified_word)

/Users/richardbalmer/.pyenv/versions/anaconda2-4.4.0/lib/python2.7/site-packages/nltk/stem/snowball.pyc in __Suffix_Verb_Step2a(self, token)
    533                     break
    534
--> 535                 if suffix in self.__conjugation_suffix_verb_present and len(token) > 5:
    536                     token = token[:-2]  # present
    537                     self.suffix_verb_step2a_success = True

AttributeError: 'ArabicStemmer' object has no attribute '_ArabicStemmer__conjugation_suffix_verb_present'
@alvations
Copy link
Contributor

@richbalmer Thanks for reporting the issue.

@LBenzahia Could you help to look into this? Thanks in advance!

@greenat92
Copy link
Contributor

greenat92 commented Oct 13, 2017

Hi @richbalmer thank you for reporting, First word 'تسدد' is the best possible stem because Snowball arabic stemmer based on light stemming algorithm deals with prefixes/suffixes, if you are looking for the root of "تسدد" you can use ISRI (root-based stemmer/deep stemming), The second word 'من' is a stop word, you should use stop word filter before start using Snowball ArabicStemmer, Also this stemmer doesn't deal with the case when the word have 2 letters.
Anyways, I've fixed the problem in this PR #1856.
Thank you again !

@richbalmer
Copy link
Author

@LBenzahia thanks for looking into this so quickly! I'm getting:

  File "/Users/richardbalmer/src/nltk/nltk/stem/util.py", line 24
    arabic_stopwords = ['إذ',
                             ^
SyntaxError: Non-ASCII character '\xd8' in file /Users/richardbalmer/src/nltk/nltk/stem/util.py on line 24, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Which also appears to be causing the tests to fail on Jenkins (https://nltk.ci.cloudbees.com/job/pull_request_tests/454/TOXENV=py27-jenkins,jdk=jdk8latestOnlineInstall/testReport/nose.failure/Failure/runTest/). I think all you need to do is put # -*- coding: utf-8 -*- at the top of stem/util.py.

Also, after fixing that locally I get a UnicodeWarning:

/Users/richardbalmer/src/nltk/nltk/stem/snowball.py:748: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if word in arabic_stopwords:

It might be worth making those stopwords unicode strings.

Other than that it looks like your fix works nicely for me - thanks again!

p.s. One other suggestion: testing set inclusion is quite a lot faster than list inclusion, so it might be worth making that stopword list a set instead.

@greenat92
Copy link
Contributor

greenat92 commented Oct 13, 2017

@richbalmer are you using python2.7 ? ,

It might be worth making those stopwords unicode strings.

done for python2.7 , test it again and tell me,It works fine for me. i've updated the PR

@richbalmer
Copy link
Author

Yup I'm using 2.7. Looking good @LBenzahia - thanks again!

@NouraAls
Copy link

Still having the error :
AttributeError: 'ArabicStemmer' object has no attribute '_ArabicStemmer__conjugation_suffix_verb_present'

I'm using python 3

@greenat92
Copy link
Contributor

@NouraAls solved in PR

stevenbird added a commit that referenced this issue Oct 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants