Supervised learning (also known as behavioral cloning) if the agent is learning what to do in an observe-act cycle) is a type of decision making method. constituents input space: \mathcal{X} output space: \mathcal{Y} hypothesis/model/prediction: h : \mathcal{X} \to \mathcal{Y} requirements Our ultimate goal is to learn a good model h from the training set: what “good” means is hard to define we generally want to use the model on new data, not just the training set continuous \mathcal{Y} is then called a regression problem; discrete \mathcal{Y} is called a classification problem. That is, we want our hypothesis to behave as h_{\theta}\left(x^{(i)}\right) \approx y^{(i)}. additional information training set The training set is a set of pairs:

\begin{equation} \left\{\left(x^{(1)}, y^{(1)}\right) \dots \left(x^{(n)}, y^{(n)}\right)\right\} \end{equation}

such that x^{(j)} \in \mathcal{X}, y^{(j)} \in \mathcal{Y}. We call n the training set size. main procedure provide the agent with some examples use an automated learning algorithm to generalize from the example This is good for typically representative situations, but if you are throwing an agent into a completely unfamiliar situation, supervised learning cannot perform better. Disadvantages the labeled data is finite limited by the quality of performance in the training data interpolation between states are finite cost function see cost function evaluation see machine learning evaluation linear classification is convex We can separate two sets of points \left\{x_1 \dots x_{N}\right\}, \left\{y_1 \dots y_{M}\right\} by a hyperplane. That is, we can: find a \in \mathbb{R}^{N}, b \in \mathbb{R} with a^{T}x_{i} + b > 0, i = 1 … N, a^{T}y_{i} + b < 0, i = 1 … M. This is homogeneous in a,b thus we can shift the boundary around: find a \in \mathbb{R}^{N}, b \in \mathbb{R} with a^{T}x_{i} + b \geq 1, i = 1 … N, a^{T}y_{i} + b \leq -1, i = 1 … M. thus we have obtained an LP feasibility problem. robust linear discrimination Maybe you want your hyperplanes to be separated by some distance:

\begin{align} \mathcal{H}_{1} = \left\{z \mid a^{T}z + b = 1\right\} \end{align}
\begin{align} \mathcal{H}_{2} = \left\{z \mid a^{T}z + b = -1\right\} \end{align}

the distance between them is then \frac{2}{\norm{a}_{2}}; hence to separate points by maximum margin:

\begin{align} \min_{a,b}\quad & \left(\frac{1}{2}\right)\norm{a}_{2}^{2} \\ \textrm{s.t.} \quad & a^{T}x_{i} + b \geq 1, i = 1 \dots N \\ & a^{T}y_{i} + b \leq -1, i = 1 \dots M \end{align}

approximate linear separation “minimize misclassification”

\begin{align} \min_{u,v,a,b}\quad & 1^{T}u + 1^{T}v \\ \textrm{s.t.} \quad & a^{T} x_{i} + b \geq 1 - u_{i} \\ & a^{T}y_{i} + b \leq -1 + v_{i} \\ & u \geq 0 \\ & v \geq 0 \end{align}

interpretation: u_{i} is the “extra slack” you are giving and we want as little of it as possible.

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?