KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'." #258

radkoff · 2019-07-18T01:28:07Z

steps to reproduce

First create the following Corpus, save it to disk, and note that upon reloading you can still get word doc counts:

import textacy
corpus = textacy.Corpus('en', ['Pittsburgh', 'slated for. Stacey designated as moderator'])
corpus.save('foo.textacy')
corpus = textacy.Corpus.load('en', 'foo.textacy')
print(corpus.word_doc_counts())

But then open a new Python shell, load the same corpus from disk, and get an error about a word ID missing from the vocab:

import textacy
corpus = textacy.Corpus.load('en', 'foo.textacy')
print(corpus.word_doc_counts())

Traceback (most recent call last):
  File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3296, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-dff4867a4989>", line 3, in <module>
    print(corpus.word_doc_counts())
  File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/textacy/corpus.py", line 494, in word_doc_counts
    normalize=normalize, weighting="binary", as_strings=as_strings
  File "/Users/radkoff/anaconda3/envs/st-py37/lib/python3.7/site-packages/textacy/spacier/doc_extensions.py", line 511, in to_bag_of_words
    lex = vocab[wid]
  File "vocab.pyx", line 237, in spacy.vocab.Vocab.__getitem__
  File "lexeme.pyx", line 44, in spacy.lexeme.Lexeme.__init__
  File "vocab.pyx", line 152, in spacy.vocab.Vocab.get_by_orth
  File "strings.pyx", line 138, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'."

context

The particular example above was narrowed down from larger texts, and strangely at this point, it seems like removing any more words causes the bug to go away. Eg, the following all work:
['Pittsburgh', 'slated for. Stacey designated moderator']
['Pittsburgh', 'slated. Stacey designated as moderator']
['Pittsburgh', 'for. Stacey designated as moderator']
['slated for. Stacey designated as moderator']
['this is doc one', 'this is doc two']

I've run into this with several different corpora (I'm trying to build IDF models).

possible solution?

I'm guessing it has something to do with trying to access the lemmas of words? Maybe the Vocab needs to be serialized along with the docs themselves? explosion/spaCy#2419

environment

platform: darwin
python: 3.7.3 (default, Mar 27 2019, 16:54:48) [Clang 4.0.1 (tags/RELEASE_401/final)]
spacy: 2.1.3
spacy_models: ['en']
textacy: 0.7.1

The text was updated successfully, but these errors were encountered:

radkoff · 2019-07-18T02:17:32Z

After upgrading textacy and spacy, the error now seems to be intermittent (or maybe it was before?..), so you may have try loading it in a new shell a few times before it fails.

platform: darwin
python: 3.7.3 (default, Mar 27 2019, 16:54:48) [Clang 4.0.1 (tags/RELEASE_401/final)]
spacy: 2.1.3
spacy_models: ['en']
textacy: 0.8.0

radkoff added the bug label Jul 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'." #258

KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'." #258

radkoff commented Jul 18, 2019

radkoff commented Jul 18, 2019

KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'." #258

KeyError: "[E018] Can't retrieve string for hash '10542206011124529393'." #258

Comments

radkoff commented Jul 18, 2019

steps to reproduce

context

possible solution?

environment

radkoff commented Jul 18, 2019