Skip to content

MirunaPislar/Word2vec

Repository files navigation

word2vec

Implementation of two word2vec algorithms from scratch: skip-gram (with negative sampling) and CBOW (continuous bag of words).

No machine learning libraries were used. All gradients and cost functions are implemented from scratch. I provide various "sanity-check" tests for all main functionalities implemented.

Important: the entire project and implementation are inspired by the first assignment in Stanford's course "Deep Learning for NLP" (2017). The tasks can be found at this address.

The word vectors are trained on the Stanford Sentiment Treebank (SST). Stochastic gradient descent is used for updating. The entire training process (roughly 40,000 iterations) will take ~3 hours on a standard machine (no GPUs). These word-vectors can be used to perform a (very simple) sentiment analysis task. Alternatively, pre-trained vectors can be used. More details about various parts of the implementation can be found in the assignment description (attached as a pdf, assignment1_description).

List of requirements

How to run

  1. To download the datasets:
chmod +x get_datasets.sh
./get_datasets.sh

to 2. To train some word embeddings:

python word2vec.py
  1. To perform sentiment analysis with your word vectors:
python sentiment_analysis.py --yourvectors
  1. To perform sentiment analysis with pretrained word vectors (GloVe):
python sentiment_analysis.py --pretrained

Vector space

Vector space visualisation

Licence

All my source code is licensed under the MIT license. Consider citing Stanford's Sentiment Treebank if you use the dataset. If you are using this code for purposes other than educational, please acknowledge Stanford's course as they were the initiators of the project, providing many core parts included in the current implementation.

Resources

About

word2vec implementation (for skip-gram and cbow) and simple application of word2vec in sentiment analysis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published