A repository to explore dense representations of words.
- Most words are symbols for an extra-linguistic entity - a word is a signifier that maps to a signified (idea/thing)
- Approx. 13m words in English language
- There is probably some N-dimensional space (such that N << 13m) that is sufficient to encode all semantics of our language
- Most simple word vector - one-hot encoding
- Denotational semantics - the concept of representing an idea as a symbol - a word or one-hot vector - sparse, cannot capture similarity - localist encoding
Evaluation
- Intrinsic - evaluation on a specific, intermediate task
- Fast to compute
- Aids with understanding of the system
- Needs to be correlated with real task to provide a good measure of usefulness
- Word analogies - popular intrinsic evaluation method for word vectors
- Semantic - e.g. King/Man | Queen/Woman
- Syntactic - e.g. big/biggest | fast/fastest
- Extrinsic - evaluation on a real task
- Slow
- May not be clear whether the problem with low performance is related to a particular subsystem, other subsystems, or interactions between subsystems
- If a subsystem is replaced and performance improves, the change is likely to be good
- https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/
- http://mccormickml.com/
- https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html#examples-word2vec-on-game-of-thrones
- SVD-based
- LSA
- LDA
- word2vec
- GloVe
- Loop over dataset and accumulate word co-occurrence counts in a matrix
$X$ - Perform a SVD on
$X$ to get a$USV^T$ decomposition - Use the rows of
$U$ as the word embeddings (use the first$k$ columns to limit the embedding dimension)
Methods to compute
- Word-Document matrix
- Each time word
$i$ appears in document$j$ , increment$X_{ij}$ -
$X \in \mathcal{R}^{V \times M}$ , where$M$ is the number of documents
- Each time word
- Window-based co-occurrence
$X \in \mathcal{R}^{V \times V}$
Variance captured by embeddings:
Problems:
- Vocab fixed at start - based on corpus
- Matrix is sparse
- Matrix is high-dimensional (quadratic cost for SVD)
- Need to perform some hacks to adjust for imbalanced word frequencies
Solutions to issues:
- Ignore function words
- Weight co-occurrence counts based on distance between words in the document
- Use Pearson correlation and set negative counts to 0 instead of using just the raw count
- 2013, Mikolov et al. Efficient Estimation of Word Representations in Vector Space. arxiv:1307.3781v3
- 2013, Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013.
Two model architectures:
- Continuous Bag-of-Words (CBOW) - uses context words to predict target word
- Continuous Skip-gram - uses target word to predict context word
Training methods
- Negative sampling - include negative examples in computation of cost function
- Better for frequent words and lower dimensional vectors
- Hierarchical softmax - define objective using an efficient tree structure to compute probabilities for the complete vocabulary
- Better for infrequent words
Objective function (cross-entropy)
Since
where
Assumption: Given the center word, all output words are completely independent (Naive Bayes assumption)
Use negative log-likelihood for better scaling
-
$c$ is the size of the training context (which can be a function of the center word$w$ ) - Larger
$c$ - more training examples, higher accuracy, increased training time
The probability
-
$u_w$ and$v_w$ are the 'input' and 'output' vector representations of$w$ , and$W$ is the size of the vocabulary - This formulation is impractical computationally because it requires computing the softmax over all the representations in the vocabulary
The probability
where
- Combines results from LSA with word2vec
- Predict probability of word
$j$ occuring in the context of word$i$ using global statistics (e.g. co-occurrence counts)