Skip to content
This repository has been archived by the owner on Feb 22, 2021. It is now read-only.

Memory problem in building wiki2vec model via gensim #7

Open
nooralahzadeh opened this issue Aug 11, 2015 · 13 comments
Open

Memory problem in building wiki2vec model via gensim #7

nooralahzadeh opened this issue Aug 11, 2015 · 13 comments
Labels

Comments

@nooralahzadeh
Copy link

Hi
Did you have memory problem in loading the trained wiki2vec model in gensim,
I trained with size=500, window=10, min_count=10 based on last enwikipedia dump . So it created the 13g wiki2vec model, For loading on gensim I have memoryerror problem.
Do you have any idea how much memory I need ?

@dav009
Copy link
Contributor

dav009 commented Aug 11, 2015

yeah, this is due to the vocabulary size.
I think there has been some work around in gensim's wiki2vec implementation since last time I saw.

if you are only interested on getting the entities' vectors then @phdowling has a gensim branch for it. Which applies a filter of min_count on anything that is not an entity vector.

Otherwise reducing your vocabulary by either:

  • passing a higher min_count
  • There is some noise in the lib that cleans the wikitext and generates some garbage

@nooralahzadeh
Copy link
Author

Exactly I want to have just entity vectors. what I have to do ?
Thanks

@dav009
Copy link
Contributor

dav009 commented Aug 11, 2015

So I think best you can do at the moment is to use this gensim fork(the develop branch) : https://github.com/piskvorky/gensim/ that fork contains some changes which will help you deal with the vocab size.

One thing, depending on your current setup(linux or OsX) you might want to put attention on how to compile gensim using cython so that when gensim runs it makes use of all your cores.

give it a go and let us know if it goes alright.

@mal mal added the backlog label Sep 10, 2015
@jesuisnicolasdavid
Copy link

Hi everyone, i have the same issue with the memory error. I am trying to increase the min_count to get rid of the error, but nothing is working. Any thought ? Is there a way to reduce the dimension from 1000 to maybe 300 ?

from gensim.models import Word2Vec
word2 = Word2Vec(min_count=100)
model = word2.load("/home/dev/work_devbox1/en_1000_no_stem/en.model")

@phdowling
Copy link
Contributor

@jesuisnicolasdavid if that is literally the code you are running, then changing min_count will probably not help you. You're calling the load method - this doesn't train a new model, it simply loads an existing one. My guess is the existing model simply doesn't fit into RAM.

The min_count parameter applies if you're training a new model, more specifically it filters out words that don't occur frequently enough.

How big is the file you're trying to load and how much RAM does your machine have?

@dav009 dav009 changed the title Memory problem in loading wiki2vec in gensim Memory problem in building wiki2vec in gensim Mar 4, 2016
@dav009 dav009 changed the title Memory problem in building wiki2vec in gensim Memory problem in building wiki2vec model via gensim Mar 4, 2016
@jesuisnicolasdavid
Copy link

So the file is 9GB, i tried to run the model in a first computer with a TitanX and 16GB of RAM. The model is allocating all the ram and fall into a memory error before even going to the GPU. Then, i tried the same code in a second computer with two GTX 980 and 64GB of RAM : the wiki2vec model is taking 20GB alone. Then, i run into a GPU memory error with theano through keras which said :

('Error allocating 4604368000 bytes of device memory (out of memory).', "you might consider using 'theano.shared(..., borrow=True)'")

But i think i will move this question to a theano issue :)

@dav009
Copy link
Contributor

dav009 commented Mar 4, 2016

is this the model provided in the torrent? I've loaded successfully on a 16GB machine.
If you are running into troubles you can try loading the model in a simple python script and then export the vectors to a plain file, that might be more flexible to work wihtout loading the whole thing.

@jesuisnicolasdavid
Copy link

Is there a way to make the 1000 dimensions of the pre-training a 300 dimensions ?

@dav009
Copy link
Contributor

dav009 commented Mar 9, 2016

Not that Im aware of, you can alway generate vectors of 300 dimensions, it should only take some hours.
For languages other than englishthere are models with 200/300 dimensions.

@phdowling
Copy link
Contributor

Yeah, I don't think there's an easy way to soundly change the dimensionality of the vectors.. You might be able to lower the RAM requirements by actually throwing away part of the vocabulary, i.e. loading less vectors, but this might also be quite hard if you're dealing with a raw numpy file and have no machine that can actually load it

@jesuisnicolasdavid
Copy link

Thanks guys i will try to generate a 300 dimension on my own. Im still wondering in what case a 1000 dimensions can be useful ?

@jsgriffin jsgriffin added monster and removed monster labels Apr 8, 2016
@vondiplo
Copy link

@jesuisnicolasdavid have you been successful in creating a 300 dimension vocabulary?

@dav009
Copy link
Contributor

dav009 commented Jun 20, 2016

this is probably solved inthe newest gensim version.
gonna check that out and bump the version that gets installed.

@vondiplo worth giving that a try ^

@Lugrin Lugrin added icebox and removed backlog labels Apr 10, 2017
@mal mal removed the fandango label Jan 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants