we will train a classifier on a binary prediction task: “is context words c_{1:L} likely to show up near some target word W_0?” We estimate the probability that w_{0} occurs within this window based on the product of the probabilities of the similarity of the embeddings between each context word and the target word. we have a corpus of text each word is represented by a vector go through each position t in the text, which has a center word c and set of context words o \in O use similarity of word vectors c and o to calculate P(o|c) Meaning, we want to devise a model which can predict high probabilities P(w_{t-n}|w_{t}) for small n and low probabilities for large n Word2Vec is a Bag of Words model! This is a Bag of Words model as our training does not learn any information relating to the ordering and structure between words. Likelihood If we wrote the above out:
Calculating p_{\theta} We are going to use TWO VECTORS for each word: v_{w} when w is the center word and u_{w} when w is a context words These vectors are the only parameters of our system. We actually do this only to make the math easy; to get the “word vector” for a word by averaging. Therefore:
exponentiation makes anything positive normalize over the entire vocabulary this is a softmax operation. Objective Function But we perform: descent and log on each value to prevent underflow and average why not \begin{equation} J(\theta) = \frac{1}{T} \log L(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j\neq 0}^{} \log p_{\theta}\left(w_{t+j} | w_{t}\right) \end{equation} Recall that:
Because we need to minimize this, we need the derivative of it by the parameter:
meaning, we now can just calculate the inner part:
Look! The first part is a log of an exp, which cancels out, so the derivative is just u_{0}. For the right part, by the chain rule:
Combining this whole thing, we have:
Rewriting this slightly:
Meaning:
The right side is just the softmax probabilty of each u_{x} times u_{x}, meaning its \mathbb{E}[u_{x}]; so, this loss just minimizes “error between output and expectation”. Word2Vec Variants Model skip-gram—predict probability of being side words P(o|c) CBOW—predict probability of being center word given side words Objective naive softmax (above) hierachichar softmax negative sampling (see also skip-gram with negative sampling) properties window size smaller windows: captures more syntax level information large windows: capture more semantic field information parallelogram model simple way to solve analogies problems with vector semantics: get the difference between two word vectors, and add it somewhere else to get an analogous transformation. only words for frequent words small distances but not quite for large systems allocational harm embeddings bake in existing biases, which leads to bias in hiring practices, etc. skip-gram with negative sampling skip-gram trains vectors separately for word being used as target and word being used as context. the mechanism for training the embedding: select some k, which is the count of negative examples (if k=2, every one positive example will be matched with 2 negative examples) sample a target word, and generate positive samples paired by words in its immediate window sample window size times k negative examples, where the noise words are chosen explicitly as not being near our target word, and weighted based on unigram frequency for each paired training sample, we minimize the loss via binary cross entropy loss:
recall that:
Importantly, because the softmax function is symmetric \sigma(-x) = -\sigma(x). So really our objective is:
how to sample k We actually sample from:
to give the less common words slightly higher probability.