Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'." #258

Open
radkoff opened this issue Jul 18, 2019 · 1 comment
Open
Labels

Comments

@radkoff
Copy link

radkoff commented Jul 18, 2019

steps to reproduce

First create the following Corpus, save it to disk, and note that upon reloading you can still get word doc counts:

import textacy
corpus = textacy.Corpus('en', ['Pittsburgh', 'slated for. Stacey designated as moderator'])
corpus.save('foo.textacy')
corpus = textacy.Corpus.load('en', 'foo.textacy')
print(corpus.word_doc_counts())

But then open a new Python shell, load the same corpus from disk, and get an error about a word ID missing from the vocab:

import textacy
corpus = textacy.Corpus.load('en', 'foo.textacy')
print(corpus.word_doc_counts())
Traceback (most recent call last):
  File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3296, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-dff4867a4989>", line 3, in <module>
    print(corpus.word_doc_counts())
  File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/textacy/corpus.py", line 494, in word_doc_counts
    normalize=normalize, weighting="binary", as_strings=as_strings
  File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/textacy/spacier/doc_extensions.py", line 511, in to_bag_of_words
    lex = vocab[wid]
  File "vocab.pyx", line 237, in spacy.vocab.Vocab.__getitem__
  File "lexeme.pyx", line 44, in spacy.lexeme.Lexeme.__init__
  File "vocab.pyx", line 152, in spacy.vocab.Vocab.get_by_orth
  File "strings.pyx", line 138, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'."

context

The particular example above was narrowed down from larger texts, and strangely at this point, it seems like removing any more words causes the bug to go away. Eg, the following all work:
['Pittsburgh', 'slated for. Stacey designated moderator']
['Pittsburgh', 'slated. Stacey designated as moderator']
['Pittsburgh', 'for. Stacey designated as moderator']
['slated for. Stacey designated as moderator']
['this is doc one', 'this is doc two']

I've run into this with several different corpora (I'm trying to build IDF models).

possible solution?

I'm guessing it has something to do with trying to access the lemmas of words? Maybe the Vocab needs to be serialized along with the docs themselves? explosion/spaCy#2419

environment

  • platform: darwin
  • python: 3.7.3 (default, Mar 27 2019, 16:54:48) [Clang 4.0.1 (tags/RELEASE_401/final)]
  • spacy: 2.1.3
  • spacy_models: ['en']
  • textacy: 0.7.1
@radkoff radkoff added the bug label Jul 18, 2019
@radkoff
Copy link
Author

radkoff commented Jul 18, 2019

After upgrading textacy and spacy, the error now seems to be intermittent (or maybe it was before?..), so you may have try loading it in a new shell a few times before it fails.

  • platform: darwin
  • python: 3.7.3 (default, Mar 27 2019, 16:54:48) [Clang 4.0.1 (tags/RELEASE_401/final)]
  • spacy: 2.1.3
  • spacy_models: ['en']
  • textacy: 0.8.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant