deep learning

supervised learning with non-linear models.
Motivation Previously, our learning method was linear in the parameters \theta (i.e. we can have non-linear x, but our \theta is always linear). Today: with deep learning we can have non-linearity with both \theta and x.
constituents We have \left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n} the dataset Our loss J^{(i)}\left(\theta\right) = \left(y^{(i)} - h_{\theta}\left(x^{(i)}\right)\right)^{2} Our overall cost: J\left(\theta\right) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\left(\theta\right) Optimization: \min_{\theta} J\left(\theta\right) Optimization step: \theta = \theta - \alpha \nabla_{\theta} J\left(\theta\right) Hyperparameters: Learning rate: \alpha Batch size B Iterations: n_{\text{iter}} stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch) neural network requirements additional information Background Notation:
x is the input, h is the hidden layers, and \hat{y} is the prediction.
We call each weight, at each layer, from x_{i} to h_{j}, \theta_{i,j}^{(h)}. At every neuron on each layer, we calculate:

\begin{equation} h_{j} = \sigma\left[\sum_{i}^{} x_{i} \theta_{i,j}^{(h)}\right] \end{equation}

\begin{equation} \hat{y} = \sigma\left[\sum_{i}^{} h_{i}\theta_{i}^{(y)}\right] \end{equation}

note! we often