Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Issue #10

Open
visualizeMath opened this issue Apr 25, 2016 · 3 comments
Open

Training Issue #10

visualizeMath opened this issue Apr 25, 2016 · 3 comments

Comments

@visualizeMath
Copy link

Hi. First of all thank you very much for your help. You have saved my life at least several times :) My question is that I have experinced some problems while training word2vec with large data corpus. The data i'd like to use for training process is almost 4 Gb. I wonder whether if it's possible or not. I tried to train word2vec with 2 Gb data and it didn't work too.Shall i increase the heap-size or something like that ?

@eabdullin
Copy link
Owner

Can you share your training data? I'll try to train vectors :)

@CaCTuCaTu4ECKuu
Copy link

CaCTuCaTu4ECKuu commented Jun 30, 2016

I find out where is this issue and #1
I use some 100mb internet data and it was surprise that there is exception, but ther i understand that when I do StreamReader.ReadLine() I read a whole file which is storing with only spaces and thats cause an exception. And actually I dont even sure what to do to save same performance, because there is threads and seek, but you cant just seek through single line so

@CaCTuCaTu4ECKuu
Copy link

I solve this by preprocessing train file and separating some amount of words in single line because solid line cause issues even when opening with notepad++ when opening processed files occurs instantly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants