Neural Network Unit A real-valued vector as input, each multiplied by some weights, summed, and squashed by some non-linear transform.
and then, we will squash this using it as an “activation”
One common activation is sigmoid. So, one common formulation would be:
Tanh \begin{equation} y(z) = \frac{e^{z} - e^{-z}}{e^{z}+e^{-z}} \end{equation} This causes “saturation”—meaning derivatives to be 0 at high values relu \begin{equation} y(z) = \max(z,0) \end{equation} multi-layer networks Single computing units can’t compute XOR. Consider a perceptron:
meaning:
meaning, obtain a line that acts as a decision boundary—we obtain 0 if the input is on one side of the line, and 1 if on the other. XOR, unfortunately, does not have a single linear boundary, its not linearly seperable. logistic regression, for instance, can’t compute XOR because it is linear until squashing. feed-forward network we can think about logistic regression as a one layer network, generalizing over sigmoid:
and a multinomial logistic regression which uses the above. This is considered a “layer” in the feed-forward network. notation: W^{(j)}, weight matrix for layer j b^{(j)}, the bias vector for layer j g^{(j)}, the activation function at j and z^{(i)}, the output at i (before activation function) a^{(i)}, the activation at i instead of bias, we sometimes add a dummy node a_{0}, we will force a value 1 at a_{0} and use its weights as bias. embeddings We use vector-space model to feed words into networks: converting each word first into embeddings, then feeding it into the network Fix length problems: sentence embedding (mean of all the embeddings) element wise max of all the word embeddings to create sentence embedding use the max length + pad For Language Models, we can use a “sliding window”; that is:
Training For every tuple (x,y), we run a forward pass to obtain \hat{y}. Then, we run the network backwards to update the weights. A loss function calculates the negative of the probability of the correct labels. backpropegation backprop