Skip to content

Latest commit

 

History

History
61 lines (47 loc) · 2.36 KB

03-Word_representations.md

File metadata and controls

61 lines (47 loc) · 2.36 KB

Word Representation

  • The dog CHASES the cat and The dog FOLLOWS the cat should be somehow similar to the computer.
  • Same with Google: I want to buy a dog/puppy.

How do we represent the meaning of a word?

  • Can use the meaning of the dictionary. Useful for humans, not for computers.
  • Common solution: WordNet, a thesaurus, collection of words, containing lists of synonyms sets and hypernyms ("is a" relationships).

Problems with WordNet:

  • Not all synonyms are complete synonyms: proficient and good. Only on some contexts.
  • Missing new meanings of words: genius, badasss, etc.
  • Impossible to keep up-to-date. Not automatic.
  • No way to say that chair and table are somehow similar.

Representing as discrete symbols:

  • [0 0 0 0 0 0 0 1 0 0 0]
  • One-hot vectors: just one 1, the rest 0s.
  • Vector dimension = #words in vocabulary (e.g. .5M).
  • No notion of similarity. So motel and hotel are similar but have really different representation.
  • Can try to use WordNet, but all the drawbacks still remain.
  • Instead: learn to encode similarity in the vectors themselves.
  • E.g. first dimension is noun 0/1, etc.

Representing words by their context:

  • "You should know a word by the company it keeps".
  • Successful in NLP.
  • Context is though as a fixed-size window around the word.

Word vectors:

  • Dense vector for each word; similar words have similar contexts.
  • Notation: word embeddings or word representations.

Word meaning as a neural word vector. Visualization.

Word2vec:

  • P(t | c) or P(t | c): probability of the context given the word which we see in the context.
  • Likelihood is a function to be optimized.
  • Uses softmax.

Train the model:

  • Theta represents all model parameteres, in one long vector. All the centered words and context words.
  • Calculating all gradients.
  • Uses two vectors.
  • Two models: skip-grams (SG, independent of position), continuous bag of words (CBOW).
  • Can add negative sampling.

Optimization:

  • We get an objective function J(θ), and minimize it.
  • There is a learning rate (conveys a learning step).

Stochastic Gradient Descent:

  • It is slow to calculate everything.
  • So just take a small subset of your data (window), and then apply to everything.
  • The good thing is that Stochastic Gradient Descent escapes better from local minimum.