porter stemmer: string index out of range #1581

ghost · 2017-01-07T05:49:30Z

fievelk · 2017-01-07T10:39:26Z

For future reference, I copy/paste your question here:

I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer. For reasons specific to my project, I would like to do the stemming inside of a django app view.

However, when stemming the documents inside the django view, I receive an IndexError: string index out of range exception from PorterStemmer().stem() for the string 'oed'. As a result, running the following:

# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer

def get_results(request):
    s = PorterStemmer()
    s.stem('oed')
    return render(request, 'list.html')

raises the mentioned error:

Traceback (most recent call last):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
    response = get_response(request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
    s.stem('oed')
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
    stem = self._step1b(stem)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
    lambda stem: (self._measure(stem) == 1 and
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
    if suffix == '*d' and self._ends_double_consonant(word):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
    word[-1] == word[-2] and
IndexError: string index out of range

Now what is really odd is running the same stemmer on the same string outside django (be it a seperate python file or an interactive python console) produces no error. In other words:

# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

followed by:

python test.py
# successfully prints 'o'

what is causing this issue?

ghost · 2017-01-07T18:17:41Z

I have found that this issue is specific to nltk version 3.2.2. Originally, I ran test.py using ipython not python, as stated above. Somehow, I was able to access the ipython installation in my root environment //anaconda/bin/ipython even though I had not specified ipython in my django project's (the activated) virtual environment //anaconda/envs/xkcd/bin/. As a result, ipython must have been using the nltk installtion defined in my root environment as well which runs version 3.2.0.

To clarify, I have discovered that the PorterStemmer fails to stem the string 'oed' in nltk version 3.2.2 but not in nltk version 3.2.0. Why I have no idea.

As a side note, I was using python 2 in both cases. My root environment uses python 2.7.11 and my django project's environment uses python 2.7.13

fievelk · 2017-01-07T19:10:13Z

@ExplodingCabbage could you please investigate this issue? The only commit I can see on porter.py after 3.2 has been released is d8402e3.

fievelk · 2017-01-07T19:31:58Z

This is the code used in the example provided by @jkarimi91.

from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

Debugging the code above using pdb from within _apply_rule_list() in porter.py, after a few iterations you get:

>>> rule
(u'at', u'ate', None)
>>> word
u'o'

At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.

If I'm not mistaken, in NLTK 3.2 the relative method was the following:

def _doublec(self, word):
    """doublec(word) is TRUE <=> word ends with a double consonant"""
    if len(word) < 2:
        return False
    if (word[-1] != word[-2]):		
        return False		
    return self._cons(word, len(word)-1)

As far as I can see, the len(word) < 2 check is missing in the new version.

Changing _ends_double_consonant() to something like this should work:

def _ends_double_consonant(self, word):
      """Implements condition *d from the paper

      Returns True if word ends with a double consonant
      """
      if len(word) < 2:
          return False
      return (
          word[-1] == word[-2] and
          self._is_consonant(word, len(word)-1)
      )

ExplodingCabbage · 2017-01-07T19:43:07Z

Yikes. Yep, looks like I broke this in d8402e3 :(

Will PR a test and a fix tonight.

stevenbird · 2017-01-07T22:42:21Z

Thanks @jkarimi91, @fievelk, @ExplodingCabbage

santoshbs · 2017-02-10T05:37:00Z

Hi, I encountered the exact same issue today. Could you please suggest how I could get a fix to this? Should I update any packages?

ExplodingCabbage · 2017-02-10T10:16:52Z

Hi @santoshbs. You can either use the master version of NLTK or release 3.2.1 to get rid of the bug; it only exists in version 3.2.2.

fievelk · 2017-02-10T10:19:34Z

@ExplodingCabbage I think you are referring to the develop branch (not master). It's easy to get confused I guess :)

ExplodingCabbage · 2017-02-10T10:25:12Z

@fievelk you are quite right. Sorry, yes: you can either use the develop branch or 3.2.1 to get rid of the bug.

santoshbs · 2017-02-10T15:45:49Z

Thanks so much for the pointer.

nltk/nltk#1581

fievelk added the pleaseverify label Jan 7, 2017

ExplodingCabbage added a commit to ExplodingCabbage/nltk that referenced this issue Jan 7, 2017

Add test for nltk#1581

daa4cdb

ExplodingCabbage added a commit to ExplodingCabbage/nltk that referenced this issue Jan 7, 2017

Fix nltk#1581

503f8c8

ExplodingCabbage mentioned this issue Jan 7, 2017

Fix Porter stemmer failing on 'oed' #1582

Merged

stevenbird closed this as completed in #1582 Jan 7, 2017

ExplodingCabbage mentioned this issue Feb 9, 2017

"IndexError: string index out of range" on trying to stem the word "oing" #1614

Closed

JackBurdick added a commit to JackBurdick/nlp_sentiment_rnn that referenced this issue Jun 26, 2017

begin preprocessing - may be an error with nltk

7506c1b

nltk/nltk#1581

PabloDino mentioned this issue Sep 9, 2019

Update various regex escape sequences #2378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

porter stemmer: string index out of range #1581

porter stemmer: string index out of range #1581

ghost commented Jan 7, 2017

fievelk commented Jan 7, 2017

ghost commented Jan 7, 2017 •

edited by ghost

fievelk commented Jan 7, 2017 •

edited

fievelk commented Jan 7, 2017 •

edited

ExplodingCabbage commented Jan 7, 2017

stevenbird commented Jan 7, 2017

santoshbs commented Feb 10, 2017

ExplodingCabbage commented Feb 10, 2017

fievelk commented Feb 10, 2017 •

edited

ExplodingCabbage commented Feb 10, 2017

santoshbs commented Feb 10, 2017

porter stemmer: string index out of range #1581

porter stemmer: string index out of range #1581

Comments

ghost commented Jan 7, 2017

fievelk commented Jan 7, 2017

ghost commented Jan 7, 2017 • edited by ghost

fievelk commented Jan 7, 2017 • edited

fievelk commented Jan 7, 2017 • edited

ExplodingCabbage commented Jan 7, 2017

stevenbird commented Jan 7, 2017

santoshbs commented Feb 10, 2017

ExplodingCabbage commented Feb 10, 2017

fievelk commented Feb 10, 2017 • edited

ExplodingCabbage commented Feb 10, 2017

santoshbs commented Feb 10, 2017

ghost commented Jan 7, 2017 •

edited by ghost

fievelk commented Jan 7, 2017 •

edited

fievelk commented Jan 7, 2017 •

edited

fievelk commented Feb 10, 2017 •

edited