Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed crash on loading http://nlp.stanford.edu/data/wordvecs/glove.84… #61

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ok-ok-ok-ok
Copy link

fixed crash on loading http://nlp.stanford.edu/data/wordvecs/glove.840B.300d.zip pretrained model

@maciejkula
Copy link
Owner

May I suggest you post the message you are getting, as well as add a test for this?

@maciejkula maciejkula closed this Apr 25, 2017
@maciejkula maciejkula reopened this Apr 25, 2017
@maciejkula
Copy link
Owner

Closed and re-opened to trigger Circle build.

@richbalmer
Copy link

I suspect the issue here is some Python unicode badness: the pretrained GloVe vectors contain two lines that Python thinks hold vectors for the same word:

head -138701 GloVe/glove.840B.300d.txt | cut -d' ' -f1 | tail -1
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������ ���������������������������������������������������������������������������������������������������������������������������������
head -140649 GloVe/glove.840B.300d.txt | cut -d' ' -f1 | tail -1
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

So as we read in the second line we successfully add a new entry to the "vectors" array but when adding it to the "dct" dict the second vector overwrites the first. This means that we try to reshape to the wrong size matrix (note that 2196016 * 300 is 300 less than 658805100: our missing entry in "dct" is causing the error):

... in load_stanford(cls, filename)
    265         instance.word_vectors = (np.array(vectors)
    266                                  .reshape(no_vectors,
--> 267                                           no_components))
    268         instance.word_biases = np.zeros(no_vectors)
    269         instance.add_dictionary(dct)

ValueError: cannot reshape array of size 658805100 into shape (2196016,300)

The change in this PR solves this for me. It is just throwing away the second vector though, so maybe a solution that distinguishes between those two different words would be better? That said, I don't think that making a distinction between those two bits of garbled nonsense is particularly important, or even meaningful.

@chan0park chan0park mentioned this pull request Mar 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants