gradient descent

It’s hard to make globally optimal solution, so therefore we instead make local progress.
constituents parameters \theta step size \alpha cost function J (and its derivative J’) requirements let \theta^{(0)} = 0 (or a random point), and then:

\begin{equation} \theta^{(t+1)} = \theta^{(t)} - \alpha J’\left(\theta^{(t)}\right) \end{equation}

“update the weight by taking a step in the opposite direction of the gradient by weight”. We stop, btw, when its “good enough” because the training data noise is so much that like a little bit non-convergent optimization its fine.
additional information multi-dimensional case \begin{equation} \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla J\left(\theta^{(t)}\right) \end{equation}
where:

\begin{equation} \nabla J(\theta) = \mqty(\dv \theta_{1} J(\theta) \\ \dots \\ \dv \theta_{d} J(\theta)) \end{equation}

gradient descent for least-squares error We have:

\begin{equation} J\left(\theta\right) = \frac{1}{2} \sum_{i=1}^{n} \left(h_{\theta }\left(x^{(i)}\right) - y^{(i)}\right)^{2} \end{equation}

we want to take the derivative of this, which actually is chill

\begin{equation} \dv \theta_{j }J(\theta) = \sum_{i=1}^{n}\left(h_{\theta } \left(x^{(i)}\right) - y^{(i)}\right) \dv \theta_{j} h_{\theta} \left(x^{(i)}\right) \end{equation}

recall that h_{\theta}(x) = \theta_{0} x_{0} + \ldots
and so: \dv \theta_{j} h_{\theta}(x) = x_{j} since every other term goes to 0.
So, our update rule is:

\begin{align} \theta_{j}^{(t+1)} &= \theta_{j}^{(t)} - \alpha \dv \theta_{j} J\left(\theta^{(t)}\right) \\ &= \theta_{j}^{(t)} -\alpha \sum_{i=1}^{n} \left(h_{\theta}\left(x^{(i)}\right) - y^{(i)}\right)x_{j}^{(i)} \end{align}

Meaning, in vector notation: \theta^{(t+1)} = \theta^{(t)}-\alpha \sum_{i=1}^{n} \left(h_{\theta }\left(x^{(i)}\right) - y^{(i)}\right)x^{(i)}
when does gradient descent provably work? … on convex functions
stochastic gradient descent see stochastic gradient descent