Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

porter stemmer: string index out of range #1581

Closed
ghost opened this issue Jan 7, 2017 · 11 comments
Closed

porter stemmer: string index out of range #1581

ghost opened this issue Jan 7, 2017 · 11 comments

Comments

@ghost
Copy link

ghost commented Jan 7, 2017

see the following stackoverflow post

@fievelk
Copy link
Member

fievelk commented Jan 7, 2017

For future reference, I copy/paste your question here:


I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer. For reasons specific to my project, I would like to do the stemming inside of a django app view.

However, when stemming the documents inside the django view, I receive an IndexError: string index out of range exception from PorterStemmer().stem() for the string 'oed'. As a result, running the following:

# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer

def get_results(request):
    s = PorterStemmer()
    s.stem('oed')
    return render(request, 'list.html')

raises the mentioned error:

Traceback (most recent call last):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
    response = get_response(request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
    s.stem('oed')
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
    stem = self._step1b(stem)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
    lambda stem: (self._measure(stem) == 1 and
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
    if suffix == '*d' and self._ends_double_consonant(word):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
    word[-1] == word[-2] and
IndexError: string index out of range

Now what is really odd is running the same stemmer on the same string outside django (be it a seperate python file or an interactive python console) produces no error. In other words:

# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

followed by:

python test.py
# successfully prints 'o'

what is causing this issue?

@ghost
Copy link
Author

ghost commented Jan 7, 2017

I have found that this issue is specific to nltk version 3.2.2. Originally, I ran test.py using ipython not python, as stated above. Somehow, I was able to access the ipython installation in my root environment //anaconda/bin/ipython even though I had not specified ipython in my django project's (the activated) virtual environment //anaconda/envs/xkcd/bin/. As a result, ipython must have been using the nltk installtion defined in my root environment as well which runs version 3.2.0.

To clarify, I have discovered that the PorterStemmer fails to stem the string 'oed' in nltk version 3.2.2 but not in nltk version 3.2.0. Why I have no idea.

As a side note, I was using python 2 in both cases. My root environment uses python 2.7.11 and my django project's environment uses python 2.7.13

@fievelk
Copy link
Member

fievelk commented Jan 7, 2017

@ExplodingCabbage could you please investigate this issue? The only commit I can see on porter.py after 3.2 has been released is d8402e3.

@fievelk
Copy link
Member

fievelk commented Jan 7, 2017

This is the code used in the example provided by @jkarimi91.

from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

Debugging the code above using pdb from within _apply_rule_list() in porter.py, after a few iterations you get:

>>> rule
(u'at', u'ate', None)
>>> word
u'o'

At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.

If I'm not mistaken, in NLTK 3.2 the relative method was the following:

def _doublec(self, word):
    """doublec(word) is TRUE <=> word ends with a double consonant"""
    if len(word) < 2:
        return False
    if (word[-1] != word[-2]):		
        return False		
    return self._cons(word, len(word)-1)

As far as I can see, the len(word) < 2 check is missing in the new version.

Changing _ends_double_consonant() to something like this should work:

def _ends_double_consonant(self, word):
      """Implements condition *d from the paper

      Returns True if word ends with a double consonant
      """
      if len(word) < 2:
          return False
      return (
          word[-1] == word[-2] and
          self._is_consonant(word, len(word)-1)
      )

@ExplodingCabbage
Copy link
Contributor

Yikes. Yep, looks like I broke this in d8402e3 :(

Will PR a test and a fix tonight.

ExplodingCabbage added a commit to ExplodingCabbage/nltk that referenced this issue Jan 7, 2017
ExplodingCabbage added a commit to ExplodingCabbage/nltk that referenced this issue Jan 7, 2017
@stevenbird
Copy link
Member

Thanks @jkarimi91, @fievelk, @ExplodingCabbage

@santoshbs
Copy link

Hi, I encountered the exact same issue today. Could you please suggest how I could get a fix to this? Should I update any packages?

@ExplodingCabbage
Copy link
Contributor

Hi @santoshbs. You can either use the master version of NLTK or release 3.2.1 to get rid of the bug; it only exists in version 3.2.2.

@fievelk
Copy link
Member

fievelk commented Feb 10, 2017

@ExplodingCabbage I think you are referring to the develop branch (not master). It's easy to get confused I guess :)

@ExplodingCabbage
Copy link
Contributor

@fievelk you are quite right. Sorry, yes: you can either use the develop branch or 3.2.1 to get rid of the bug.

@santoshbs
Copy link

Thanks so much for the pointer.

JackBurdick added a commit to JackBurdick/nlp_sentiment_rnn that referenced this issue Jun 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants