supervised learning with non-linear models. Motivation Previously, our learning method was linear in the parameters \theta (i.e. we can have non-linear x, but our \theta is always linear). Today: with deep learning we can have non-linearity with both \theta and x. constituents We have \left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n} the dataset Our loss J^{(i)}\left(\theta\right) = \left(y^{(i)} - h_{\theta}\left(x^{(i)}\right)\right)^{2} Our overall cost: J\left(\theta\right) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\left(\theta\right) Optimization: \min_{\theta} J\left(\theta\right) Optimization step: \theta = \theta - \alpha \nabla_{\theta} J\left(\theta\right) Hyperparameters: Learning rate: \alpha Batch size B Iterations: n_{\text{iter}} stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch) neural network requirements additional information Background Notation: x is the input, h is the hidden layers, and \hat{y} is the prediction. We call each weight, at each layer, from x_{i} to h_{j}, \theta_{i,j}^{(h)}. At every neuron on each layer, we calculate:
note! we often