- Cross Entropy: The best model for cross entropy loss was generated using an embedding size of 64 and training the model from scratch for 4 million steps without loading the pre-trained vectors.
- NCE: The best model for nce loss was created by increasing the window size. The value for skip_window was 8 and that for num_skips was 16. The training was continued from the pre-trained model for 200000 steps with all other parameters same.
The basic algorithm to generate the batches is such that the global variable data_index
keeps track of the last word
used for batch generation and also always points to the center word for the current batch. It starts from skip_window
position as we need skip_window
words on left hand side of center word for a batch and ends at
len(data) - skip_window
as we need skip_window
words on right hand side of center word. It is re-initialized back
to start if it reaches the end of data.
For every center word the data index points to, num_skips
batches are generated by iterating skip_window
times
towards left and right side of data_index
. Once the window is completed, data_index
is incremented by 1 to make the
next word center word and the loop continues until a batch of batch_size
is completed.
The calculation for cross entropy loss requires two variables A and B where A is the numerator part of the log
likelihood function while B is the denominator part.
For A, we need to find the dot product of center word with its target word for all words in the batch. This can be done
by matrix multiplication of the inputs
and true_w
matrix and then taking only the diagonal part of the matrix.
Then take the log and exp of the resultant (which is not required as they both cancel out each other) to get A.
For B, we need to find the dot product of center word with all other target words in vocab but for simplicity we only
calculate the dot product with all other target words in current batch only. We can reuse the matrix multiplication
used for A and calculate exps of each value in matrix. Use reduce sum to sum over all other target words and then
calculate the log to get the value of B.
Return the value of -(A-B)
The calculation of nce loss can be divided into two parts - one involving the center word and target word, and the other involving center word and negative words. Using the same matrix multiplication technique used in cross entropy, calculate the dot product of a center word with target word and add bias. Then subtract the log unigram probability of k times the target word from it and take the sigmoid to get the probability of a center,target word pair being in vocab. Similarly, calculate the probability of center,negative word pair being in vocab and subtract it from 1 to get the probability of the pair not being in vocab. To do this for each negative word in the current negative sample, we can use matrix multiplication and then use reduce sum to find the summation. Finally, calculate the difference between the probability of center,target word pair being in vocab and the probability of center,negative word pair not being in vocab and return its negative for Gradient Descent.
To calculate the least and most similar choice for a relation in a given set of examples, calculate the average
difference of vectors between the words of each example and find the choice whose difference vector is least and most
similar to the average vector. For similarity, we can use the cosine similarity of vectors which is calculated as
similarity = A.B/(||A||*||B||)
- Embedding Size: Changed the
embedding_size
from 128 to 64 and commented out loading of the pre trained model for both cross entropy loss and nce loss. Increasedmax_num_steps
to 4000000 as it is being trained from scratch. All other parameters are same. - Batch Size: Changed
batch_size
from 128 to 256 for both cross entropy and nce loss keeping all other parameters same. - Window Size: Changed
skip_window
from 4 to 8 andnum_skips
from 8 to 16 for both cross entropy and nce loss keeping all other parameters same. - Learning Rate: Changed
learning_rate
from 1 to 1.5 for both cross entropy and nce loss keeping all other parameters same. -
- Num Steps: Changed
max_num_steps
from 200001 to 1000001 for cross entropy loss keeping all other parameters same. - Num Negative Samples: Changed
num_sampled
from 64 to 128 for nce loss keeping all other parameters same.
- Num Steps: Changed