Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“IndexError: string index out of range” with nltk library - to delete #1732

Closed
Diyago opened this issue May 21, 2017 · 5 comments
Closed

Comments

@Diyago
Copy link

Diyago commented May 21, 2017

I'm using last possible version of nltk library - 3.2.4 with python 2.7+, but the error is still persist, which was first time described here (SO) and here #1261

The goal is to apply stemmer to dataframe:

import pandas as pd
import numpy as np
from sklearn.feature_extraction import text
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

import nltk
porter = PorterStemmer()
train = pd.read_csv("train.csv")
    
def stem_str(x,stemmer=SnowballStemmer('english')):
      x = text.re.sub("[^a-zA-Z0-9]"," ", x)
      x = (" ").join([stemmer.stem(z) for z in x.split(" ")])
      x = " ".join(x.split())
      return x
    
train['col2'] = train['col1'].astype(str).apply(lambda x:stem_str(x.lower(),porter))

As a result I get such error:

/home/.../anaconda2/lib/python2.7/site-packages/nltk/stem/porter.pyc in _ends_double_consonant(self, word)
    212         """
    213         return (
--> 214             len(word) >= 2 and
    215             word[-1] == word[-2] and
    216             self._is_consonant(word, len(word)-1)

IndexError: string index out of range

Full stack of the code:

IndexError                                Traceback (most recent call last)
<ipython-input-25-58ca95c5b364> in <module>()
----> 1 main()

<ipython-input-24-1a1fab0e5ac4> in main()
     15     print('Generate porter')
     16 
---> 17     train['question1_porter'] = train['question1'].astype(str).apply(lambda x:stem_str(x.lower(),porter))
     18     test['question1_porter'] = test['question1'].astype(str).apply(lambda x:stem_str(x.lower(),porter))
     19 

/home/analyst/anaconda2/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356 
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)()

<ipython-input-24-1a1fab0e5ac4> in <lambda>(x)
     15     print('Generate porter')
     16 
---> 17     train['question1_porter'] = train['question1'].astype(str).apply(lambda x:stem_str(x.lower(),porter))
     18     test['question1_porter'] = test['question1'].astype(str).apply(lambda x:stem_str(x.lower(),porter))
     19 

<ipython-input-18-3b87bf648e19> in stem_str(x, stemmer)
     37 def stem_str(x,stemmer=SnowballStemmer('english')):
     38         x = text.re.sub("[^a-zA-Z0-9]"," ", x)
---> 39         x = (" ").join([stemmer.stem(z) for z in x.split(" ")])
     40         x = " ".join(x.split())
     41         return x

/home/analyst/anaconda2/lib/python2.7/site-packages/nltk/stem/porter.pyc in stem(self, word)
    663             return word
    664 
--> 665         stem = self._step1a(stem)
    666         stem = self._step1b(stem)
    667         stem = self._step1c(stem)

/home/analyst/anaconda2/lib/python2.7/site-packages/nltk/stem/porter.pyc in _step1b(self, word)
    374             (
    375                 '',
--> 376                 'e',
    377                 lambda stem: (self._measure(stem) == 1 and
    378                               self._ends_cvc(stem))

/home/analyst/anaconda2/lib/python2.7/site-packages/nltk/stem/porter.pyc in _apply_rule_list(self, word, rules)
    256         """
    257         for rule in rules:
--> 258             suffix, replacement, condition = rule
    259             if suffix == '*d' and self._ends_double_consonant(word):
    260                 stem = word[:-2]

/home/analyst/anaconda2/lib/python2.7/site-packages/nltk/stem/porter.pyc in _ends_double_consonant(self, word)
    212         """
    213         return (
--> 214             len(word) >= 2 and
    215             word[-1] == word[-2] and
    216             self._is_consonant(word, len(word)-1)

IndexError: string index out of range
@Diyago
Copy link
Author

Diyago commented May 21, 2017

Link to the SO to the same problem

@Diyago
Copy link
Author

Diyago commented May 22, 2017

Please delete the issue, it`s just update problem. My bad

@Diyago Diyago changed the title “IndexError: string index out of range” with nltk library “IndexError: string index out of range” with nltk library - do delete May 22, 2017
@Diyago Diyago changed the title “IndexError: string index out of range” with nltk library - do delete “IndexError: string index out of range” with nltk library - to delete May 22, 2017
@alvations
Copy link
Contributor

alvations commented May 22, 2017

@Diyago, if it's convenient, before closing this issue, please tell us what was the update problem so that we can document it just in case another user had the same issue.

Was it something to do with the pip install -U nltk command not upgrading the correct python site-packages?

@Diyago
Copy link
Author

Diyago commented May 22, 2017

@alvations Shame on me, rly) I've been using ipython. Usually newly installed library is automatically visible after installation. But the behavior with upgrading is different. I manually restarted the kernel, but I believe old version persisted. Only restarting the notebook fixed the initial problem

@Diyago Diyago closed this as completed May 22, 2017
@alvations
Copy link
Contributor

@Diyago Thank you for documenting the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants