Memory problem in building wiki2vec model via gensim #7

nooralahzadeh · 2015-08-11T16:38:03Z

Hi
Did you have memory problem in loading the trained wiki2vec model in gensim,
I trained with size=500, window=10, min_count=10 based on last enwikipedia dump . So it created the 13g wiki2vec model, For loading on gensim I have memoryerror problem.
Do you have any idea how much memory I need ?

dav009 · 2015-08-11T16:57:54Z

yeah, this is due to the vocabulary size.
I think there has been some work around in gensim's wiki2vec implementation since last time I saw.

if you are only interested on getting the entities' vectors then @phdowling has a gensim branch for it. Which applies a filter of min_count on anything that is not an entity vector.

Otherwise reducing your vocabulary by either:

passing a higher min_count
There is some noise in the lib that cleans the wikitext and generates some garbage

nooralahzadeh · 2015-08-11T21:10:01Z

Exactly I want to have just entity vectors. what I have to do ?
Thanks

dav009 · 2015-08-11T22:17:12Z

So I think best you can do at the moment is to use this gensim fork(the develop branch) : https://github.com/piskvorky/gensim/ that fork contains some changes which will help you deal with the vocab size.

One thing, depending on your current setup(linux or OsX) you might want to put attention on how to compile gensim using cython so that when gensim runs it makes use of all your cores.

give it a go and let us know if it goes alright.

jesuisnicolasdavid · 2016-03-04T15:50:38Z

Hi everyone, i have the same issue with the memory error. I am trying to increase the min_count to get rid of the error, but nothing is working. Any thought ? Is there a way to reduce the dimension from 1000 to maybe 300 ?

from gensim.models import Word2Vec
word2 = Word2Vec(min_count=100)
model = word2.load("/home/dev/work_devbox1/en_1000_no_stem/en.model")

phdowling · 2016-03-04T16:15:10Z

@jesuisnicolasdavid if that is literally the code you are running, then changing min_count will probably not help you. You're calling the load method - this doesn't train a new model, it simply loads an existing one. My guess is the existing model simply doesn't fit into RAM.

The min_count parameter applies if you're training a new model, more specifically it filters out words that don't occur frequently enough.

How big is the file you're trying to load and how much RAM does your machine have?

jesuisnicolasdavid · 2016-03-04T17:25:53Z

So the file is 9GB, i tried to run the model in a first computer with a TitanX and 16GB of RAM. The model is allocating all the ram and fall into a memory error before even going to the GPU. Then, i tried the same code in a second computer with two GTX 980 and 64GB of RAM : the wiki2vec model is taking 20GB alone. Then, i run into a GPU memory error with theano through keras which said :

('Error allocating 4604368000 bytes of device memory (out of memory).', "you might consider using 'theano.shared(..., borrow=True)'")

But i think i will move this question to a theano issue :)

dav009 · 2016-03-04T17:36:47Z

is this the model provided in the torrent? I've loaded successfully on a 16GB machine.
If you are running into troubles you can try loading the model in a simple python script and then export the vectors to a plain file, that might be more flexible to work wihtout loading the whole thing.

jesuisnicolasdavid · 2016-03-08T16:41:29Z

Is there a way to make the 1000 dimensions of the pre-training a 300 dimensions ?

dav009 · 2016-03-09T09:57:15Z

Not that Im aware of, you can alway generate vectors of 300 dimensions, it should only take some hours.
For languages other than englishthere are models with 200/300 dimensions.

phdowling · 2016-03-09T10:20:01Z

Yeah, I don't think there's an easy way to soundly change the dimensionality of the vectors.. You might be able to lower the RAM requirements by actually throwing away part of the vocabulary, i.e. loading less vectors, but this might also be quite hard if you're dealing with a raw numpy file and have no machine that can actually load it

jesuisnicolasdavid · 2016-03-09T18:41:30Z

Thanks guys i will try to generate a 300 dimension on my own. Im still wondering in what case a 1000 dimensions can be useful ?

vondiplo · 2016-06-19T18:50:56Z

@jesuisnicolasdavid have you been successful in creating a 300 dimension vocabulary?

dav009 · 2016-06-20T08:53:19Z

this is probably solved inthe newest gensim version.
gonna check that out and bump the version that gets installed.

@vondiplo worth giving that a try ^

mal added the backlog label Sep 10, 2015

dav009 changed the title ~~Memory problem in loading wiki2vec in gensim~~ Memory problem in building wiki2vec in gensim Mar 4, 2016

dav009 changed the title ~~Memory problem in building wiki2vec in gensim~~ Memory problem in building wiki2vec model via gensim Mar 4, 2016

jsgriffin added monster and removed monster labels Apr 8, 2016

Lugrin added icebox and removed backlog labels Apr 10, 2017

mal removed the fandango label Jan 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory problem in building wiki2vec model via gensim #7

Memory problem in building wiki2vec model via gensim #7

nooralahzadeh commented Aug 11, 2015

dav009 commented Aug 11, 2015

nooralahzadeh commented Aug 11, 2015

dav009 commented Aug 11, 2015

jesuisnicolasdavid commented Mar 4, 2016

phdowling commented Mar 4, 2016

jesuisnicolasdavid commented Mar 4, 2016

dav009 commented Mar 4, 2016

jesuisnicolasdavid commented Mar 8, 2016

dav009 commented Mar 9, 2016

phdowling commented Mar 9, 2016

jesuisnicolasdavid commented Mar 9, 2016

vondiplo commented Jun 19, 2016

dav009 commented Jun 20, 2016 •

edited

Memory problem in building wiki2vec model via gensim #7

Memory problem in building wiki2vec model via gensim #7

Comments

nooralahzadeh commented Aug 11, 2015

dav009 commented Aug 11, 2015

nooralahzadeh commented Aug 11, 2015

dav009 commented Aug 11, 2015

jesuisnicolasdavid commented Mar 4, 2016

phdowling commented Mar 4, 2016

jesuisnicolasdavid commented Mar 4, 2016

dav009 commented Mar 4, 2016

jesuisnicolasdavid commented Mar 8, 2016

dav009 commented Mar 9, 2016

phdowling commented Mar 9, 2016

jesuisnicolasdavid commented Mar 9, 2016

vondiplo commented Jun 19, 2016

dav009 commented Jun 20, 2016 • edited

dav009 commented Jun 20, 2016 •

edited