word2vec

Implementation of two word2vec algorithms from scratch: skip-gram (with negative sampling) and CBOW (continuous bag of words).

No machine learning libraries were used. All gradients and cost functions are implemented from scratch. I provide various "sanity-check" tests for all main functionalities implemented.

Important: the entire project and implementation are inspired by the first assignment in Stanford's course "Deep Learning for NLP" (2017). The tasks can be found at this address.

The word vectors are trained on the Stanford Sentiment Treebank (SST). Stochastic gradient descent is used for updating. The entire training process (roughly 40,000 iterations) will take ~3 hours on a standard machine (no GPUs). These word-vectors can be used to perform a (very simple) sentiment analysis task. Alternatively, pre-trained vectors can be used. More details about various parts of the implementation can be found in the assignment description (attached as a pdf, assignment1_description).

List of requirements

python (tested with 2.7.12)
numpy (tested with 1.15.4)
scikit-learn (tested with 0.19.1)
scipy (tested with 1.0.0)
matplotlib (tested with 2.1.0)

How to run

To download the datasets:

chmod +x get_datasets.sh
./get_datasets.sh

to 2. To train some word embeddings:

python word2vec.py

To perform sentiment analysis with your word vectors:

python sentiment_analysis.py --yourvectors

To perform sentiment analysis with pretrained word vectors (GloVe):

python sentiment_analysis.py --pretrained

Vector space

Licence

All my source code is licensed under the MIT license. Consider citing Stanford's Sentiment Treebank if you use the dataset. If you are using this code for purposes other than educational, please acknowledge Stanford's course as they were the initiators of the project, providing many core parts included in the current implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
utils		utils
LICENSE.md		LICENSE.md
README.md		README.md
assignment1_description.pdf		assignment1_description.pdf
get_datasets.sh		get_datasets.sh
sentiment_analysis.py		sentiment_analysis.py
starter_simpleNN.py		starter_simpleNN.py
word2vec.py		word2vec.py
word_vectors_visualization.png		word_vectors_visualization.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utils

utils

LICENSE.md

LICENSE.md

README.md

README.md

assignment1_description.pdf

assignment1_description.pdf

get_datasets.sh

get_datasets.sh

sentiment_analysis.py

sentiment_analysis.py

starter_simpleNN.py

starter_simpleNN.py

word2vec.py

word2vec.py

word_vectors_visualization.png

word_vectors_visualization.png

Repository files navigation

word2vec

List of requirements

How to run

Vector space

Licence

Resources

About

Releases

Packages

Languages

License

MirunaPislar/Word2vec

Folders and files

Latest commit

History

Repository files navigation

word2vec

List of requirements

How to run

Vector space

Licence

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Languages