Skip to content
This repository has been archived by the owner on Feb 22, 2021. It is now read-only.

Problems of generating Corpus file #23

Open
zhq2009 opened this issue Jul 27, 2016 · 9 comments
Open

Problems of generating Corpus file #23

zhq2009 opened this issue Jul 27, 2016 · 9 comments
Labels

Comments

@zhq2009
Copy link

zhq2009 commented Jul 27, 2016

Hello,

We are using prepare.sh to generate Corpus file, but the Corpus file we generate is empty, could you please give us some suggestion of how to solve the problem?

Thank you very much

@dav009
Copy link
Contributor

dav009 commented Jul 29, 2016

what language are you trying?
can you paste the command you are running?

@zhq2009
Copy link
Author

zhq2009 commented Aug 2, 2016

Hello,

We are trying English wikipedia.
The command we are running is sudo sh prepare.sh en_US /mnt/data/, actually prepare.sh runs everything, such as downloads files and compiles programs.
We are wondering if we could get the executable programs directly. We were also experiencing compatibility problems and the generated corpus file is empty.

Thank you very much

@zhq2009
Copy link
Author

zhq2009 commented Aug 3, 2016

Hello,

We run the commands in prepare.sh manually and we get the corpus file successfully. We are currently train model using the corpus file, the massage we got from the command:

...
Requirement already satisfied (use --upgrade to upgrade): requests in /usr/lib/python2.7/dist-packages (from smart-open>=1.2.1->gensim)
Cleaning up...
pid 13182's current affinity mask: ff
pid 13182's new affinity mask: ff

and the program stays there for several hours, but the CPU usage is full.

We are wondering whether the program is running correctly and shall we wait until we get the results?

Thank you very much

@dav009
Copy link
Contributor

dav009 commented Aug 3, 2016

ZH, depending on the corpus size + number of dimensions, method(skipgram, cbow)
it can take a long time, usually for the settings of the shared models it took around 4,5 hours.
my advice is to let it run a few hours (at least 6).

Be aware that if you installed gensim manually, it might not be using all the cores.
The script provided in this repo installs it such that it uses as many cores as possible.

The first stage of word2vec will only use a single core tho (gathering the vocabulary), the batches of matrix factorization are done in parallel using as many cores as possible.

@zhq2009
Copy link
Author

zhq2009 commented Aug 9, 2016

Hello,

We use the command "wiki2vec.sh corpus output/model.w2c 50 500 10" to generate model file, after program runs for 20 hours, we get error message "IOError: [Errno 2] No such file or directory: '/home/_/_/wiki2vec/wiki2vec-master/results/model.w2c.syn1neg.npy'".

Could you please give us some suggestions about how to solve the problem?

Thank you very much.

@Lugrin Lugrin added the backlog label Oct 21, 2016
@RishabGargeya
Copy link

Hi, @zhq2009 was this issue ever resolved?

@zhq2009
Copy link
Author

zhq2009 commented Jan 3, 2017 via email

@matthewdparker
Copy link

Hi, I'm having the same problem when I try to generate the Corpus file - the file keeps coming up empty. I'm running the following command:

sudo sh prepare.sh en_US ~/data

Do you know why this might be?

Thank you!

@Lugrin Lugrin added icebox and removed backlog labels Apr 10, 2017
@Aditi138
Copy link

Hi, I am also facing the same issue.

When I ran the following snippet from gensim.models import Word2Vec
model = Word2Vec.load("path/to/word2vec/en.model")
model.similarity('woman', 'man'), I got the following error

" array.shape = shape
ValueError: cannot reshape array of size 108 into shape (1151090,1000)"

Next when I run "sudo sh prepare.sh en_US ~/data", the corpus file is empty.
Could that be related, and if not how to solve these 2 issues?

@mal mal removed the fandango label Jan 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants