Word2Vec

Tensorflow implementation of Word2Vec, a classic model for learning distributed word representation from large unlabeled dataset.

Training

Prepare your data: Your data should be one or more of text files where each line contains a sentence, and words are delimited by space.
This implementation allows you to train the model under skip gram or continuous bag-of-words architectures (--arch), and perform training using negative sampling or hierarchical softmax (--algm). To see a full list of parameters, runpython run_training.py --help.
For example you can train your model with the following command:

  python run_training.py --filenames=input/wiki1.txt,input/wiki2.txt --out_dir=output/ --window_size=5 --embed_size=300 --arch=skip_gram --algm=negative_sampling --batch_size=256

The vocabulary words and word embeddings will be saved to vocab.txt and embed.npy in the folder specified by --out_dir in the previous step (can be loaded using np.load).

Word similarity and analogy evaluation

The package used to load the evaluation datasets uses setuptools. You can install it running:

  python setup.py install

If you have problems during this installation. First you may need to install the dependencies:

  pip install -r requirements.txt

To run the similarity evaluation use the following command:

  python embedding_eval.py -e embedding/embed.npy -v vocabulary/vocab.txt -sv results/ -s

You will find your results in the folder specified by --results
To run the analogy evaluation use the following command:

  python embedding_eval.py -e embedding/embed.npy -v vocabulary/vocab.txt -sv results/ -a

You will find your results in the folder specified by --results
To have show words similarities and analogies graphically, run the following command. You can further customize this file according to your own needs:

  python show_similarities.py -e embedding/embed.npy -v vocabulary/vocab.txt

Word sense disambiguation and evaluation

Download Stanford’s Contextual Word Similarities (SCWS) at: http://ai.stanford.edu/~ehhuang/ and unzip it
Run words disambiguation script with the following command:

  word_sense_disambiguation.py -e embedding/embed.npy -v vocabulary/vocab.txt -save_path results/ --rating_path SCWS/ratings.txt

If you want to tune more parameters you can use the command word_sense_disambiguation.py --help to see a list of them.
You will find your results in the folder specified by --save_path

Results and website

https://davideliu.com/2020/03/16/word-similarity-and-analogy-with-skip-gram/

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
CHANGES.rst		CHANGES.rst
README.md		README.md
README.rst		README.rst
dataset.py		dataset.py
embedding_eval.py		embedding_eval.py
requirements.txt		requirements.txt
run_training.py		run_training.py
setup.py		setup.py
show_similarities.py		show_similarities.py
word2vec.py		word2vec.py
word_sense_disambiguation.py		word_sense_disambiguation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

CHANGES.rst

CHANGES.rst

README.md

README.md

README.rst

README.rst

dataset.py

dataset.py

embedding_eval.py

embedding_eval.py

requirements.txt

requirements.txt

run_training.py

run_training.py

setup.py

setup.py

show_similarities.py

show_similarities.py

word2vec.py

word2vec.py

word_sense_disambiguation.py

word_sense_disambiguation.py

Repository files navigation

Word2Vec

Training

Word similarity and analogy evaluation

Word sense disambiguation and evaluation

Results and website

References

About

Releases

Packages

Languages

davide97l/Word2vec

Folders and files

Latest commit

History

Repository files navigation

Word2Vec

Training

Word similarity and analogy evaluation

Word sense disambiguation and evaluation

Results and website

References

About

Topics

Resources

Stars

Watchers

Forks

Languages