requirements h\left(x\right) the predictor function x,y, the samples of data definition \begin{equation} J\left(\theta\right) = \frac{1}{2} \sum_{i=1}^{n}\left(h_{\theta }\left(x^{(i)}\right) - y^{(i)}\right)^{2} \end{equation} see also example: gradient descent for least-squares error. additional information “why the 1/2”? Because when you take \nabla J\left(\theta\right) you end up with the \frac{1}{2} and the 2 canceling out. probabilistic intuition for least-squares error in linear regression Assume that our dataset \left(x^{(i)}, y^{(i)}\right) \sim D has the following property: “the true y value is just our model’s output, plus some error.” Meaning:

\begin{equation} y^{(i)} = \theta^{\top} x^{(i)} + \varepsilon^{(i)} \end{equation}

Assume too now that \varepsilon^{(i)} \sim \mathcal{N}\left(0, \sigma^{2}\right) for all i, that the error is normally distributed. Recall the PDF of the normal distribution:

\begin{equation} P\left(\varepsilon^{(i)}\right) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left( \frac{- \left(\epsilon^{(i)}\right)^{2}}{2\sigma^{2}}\right) \end{equation}

Plugging in our definition for \varepsilon here:

\begin{equation} P\left(y^{(i)} | x^{(i)}, \theta\right) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left( \frac{- \left(y^{(i)}- \theta^{T}x^{(i)}\right)^{2}}{2\sigma^{2}}\right) \end{equation}

If we now assume the entire dataset is IID, we can then write:

\begin{align} P\left(y | x, \theta\right) &= \prod_{i=1}^{n} P\left(y^{(i)} | x^{(i)}, \theta\right) \\ &= \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}} \exp \left( \frac{- \left(y^{(i)}- \theta^{T}x^{(i)}\right)^{2}}{2\sigma^{2}}\right) \end{align}

What we want to pick \theta is to perform MLE—indeed we want the model that maximizes the likelihood of seeing our real data y. Meaning, we desire:

\begin{equation} \theta = \arg\max_{\theta} P\left(y | x,\theta\right) \end{equation}

Let’s do it! First let’s write the thing we want to maximize as a function of \theta

\begin{equation} L\left(\theta\right) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left( \frac{- \left(y^{(i)}- \theta^{T}x^{(i)}\right)^{2}}{2\sigma^{2}}\right) \end{equation}

recall log is monotonic, so

\begin{align} \arg\max_{\theta} L\left(\theta\right) &= \arg\max_{\theta} \log \left(L\left(\theta\right)\right) \\ &= \arg\max_{\theta} \log \prod_{i=1}^{n}\frac{1}{\sigma\sqrt{2\pi}} \exp \left(\dots\right) \\ &= \arg\max_{\theta} n \log \frac{1}{\sigma\sqrt{2\pi}} + \sum_{i=1}^{n} \frac{-\left(y^{(i)}- \theta^{\top}x^{(i)}\right)^{2}}{2\sigma^{2}} \end{align}

We can throw away the left term (since its just a constant, and the objective function of the right is just the least-squares error formula, with \sigma=1 (i.e. it doesn’t matter since we are just trying to maximize)! Yay!

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?