Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large dimension of the vector representation #161

Closed
stochasticer opened this issue Feb 1, 2014 · 5 comments
Closed

large dimension of the vector representation #161

stochasticer opened this issue Feb 1, 2014 · 5 comments

Comments

@stochasticer
Copy link

Hi,
thanks for your help
I tried to save a trained model with dimension of the 'feature' vector = 2000. Although the model is trained well, i am unable to save the trained model... (am using a linux terminal on windows)
Here is the error:

In [9]: model.save('model_wiki_2000')

SystemError Traceback (most recent call last)
in ()
----> 1 model.save('model_wiki_2000')

/home/usr/.local/lib/python2.7/site-packages/gensim-0.8.9-py2.7.egg/gensim/utils.pyc in save(self, fname)
178 """
179 logger.info("saving %s object to %s" % (self.class.name, fname))
--> 180 pickle(self, fname)
181 #endclass SaveLoad
182

/home/usr/.local/lib/python2.7/site-packages/gensim-0.8.9-py2.7.egg/gensim/utils.pyc in pickle(obj, fname, protocol)
528 """Pickle object obj to file fname."""
529 with smart_open(fname, 'wb') as fout: # 'b' for binary, needed on Windows
--> 530 cPickle.dump(obj, fout, protocol=protocol)
531
532

SystemError: error return without exception set

@piskvorky
Copy link
Owner

That's a bug in Python's pickle module: numpy/numpy#2396. Not much I can do about it.

A "fix" is to overload the save/load methods so that they serialize the internal NumPy arrays in model.syn0, model.syn1 separately, into different files. (and don't store syn0norm at all.)

This is what I did in the LsiModel for example: https://github.com/piskvorky/gensim/blob/develop/gensim/models/lsimodel.py#L534

Let me know if you want to write such patch for Word2Vec class too, it's not difficult.

@stochasticer
Copy link
Author

thanks! yes, will try that overloading.
btw, is this an alternative as well:
train the model via the C package, and generate the .bin file, then use Word2Vec.load_word2vec_format() to get the trained model. (hopefully, this won't take too much time, since it is a loading instead of training ?)

@piskvorky
Copy link
Owner

Ok, great! Let me know when the patch's ready for review.

Loading from C word2vec will work if you only want to use the model (and not continue training etc.). The C word2vec formats don't store all the necessary information.

@stochasticer
Copy link
Author

thanks a lot! will post if any success. (since i am currently testing with my similarity measures, the C training + Gensim loading may also give me some fast results. i will try)

@piskvorky
Copy link
Owner

@stochasticer I just pushed a series of commits that allow you to save large word2vec models directly from gensim.

You can now store with model.save('/some/file', ignore=['syn0norm', 'syn1']).

Let me know if that solved your problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants