fixed crash on loading http://nlp.stanford.edu/data/wordvecs/glove.84… #61

ok-ok-ok-ok · 2017-04-25T10:18:54Z

fixed crash on loading http://nlp.stanford.edu/data/wordvecs/glove.840B.300d.zip pretrained model

maciejkula · 2017-04-25T10:30:50Z

May I suggest you post the message you are getting, as well as add a test for this?

maciejkula · 2017-04-25T10:36:44Z

Closed and re-opened to trigger Circle build.

richbalmer · 2017-07-18T13:42:30Z

I suspect the issue here is some Python unicode badness: the pretrained GloVe vectors contain two lines that Python thinks hold vectors for the same word:

head -138701 GloVe/glove.840B.300d.txt | cut -d' ' -f1 | tail -1
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������ ���������������������������������������������������������������������������������������������������������������������������������
head -140649 GloVe/glove.840B.300d.txt | cut -d' ' -f1 | tail -1
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

So as we read in the second line we successfully add a new entry to the "vectors" array but when adding it to the "dct" dict the second vector overwrites the first. This means that we try to reshape to the wrong size matrix (note that 2196016 * 300 is 300 less than 658805100: our missing entry in "dct" is causing the error):

... in load_stanford(cls, filename)
    265         instance.word_vectors = (np.array(vectors)
    266                                  .reshape(no_vectors,
--> 267                                           no_components))
    268         instance.word_biases = np.zeros(no_vectors)
    269         instance.add_dictionary(dct)

ValueError: cannot reshape array of size 658805100 into shape (2196016,300)

The change in this PR solves this for me. It is just throwing away the second vector though, so maybe a solution that distinguishes between those two different words would be better? That said, I don't think that making a distinction between those two bits of garbled nonsense is particularly important, or even meaningful.

fixed crash on loading http://nlp.stanford.edu/data/wordvecs/glove.84…

1bb0ead

…0B.300d.zip pretrained model

maciejkula closed this Apr 25, 2017

maciejkula reopened this Apr 25, 2017

chan0park mentioned this pull request Mar 15, 2018

Update glove.py #79

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed crash on loading http://nlp.stanford.edu/data/wordvecs/glove.84… #61

fixed crash on loading http://nlp.stanford.edu/data/wordvecs/glove.84… #61

ok-ok-ok-ok commented Apr 25, 2017

maciejkula commented Apr 25, 2017

maciejkula commented Apr 25, 2017

richbalmer commented Jul 18, 2017

fixed crash on loading http://nlp.stanford.edu/data/wordvecs/glove.84… #61

Are you sure you want to change the base?

fixed crash on loading http://nlp.stanford.edu/data/wordvecs/glove.84… #61

Conversation

ok-ok-ok-ok commented Apr 25, 2017

maciejkula commented Apr 25, 2017

maciejkula commented Apr 25, 2017

richbalmer commented Jul 18, 2017