Kernel Trick — Jemoka Knowledge Base

constituents samples x,z \in \mathcal{D} d input dimensions, p featured dimensions feature map: \phi: \mathbb{R}^{d} \to \mathbb{R}^{p} kernel: k\left(x,z\right) = \langle\phi\left(x\right), \phi\left(z\right)\rangle requirements train Compute k\left(x^{(i)}, x^{(j)}\right) for all i,j \in 1 … n. Set \beta_{j} = 0 for all j. Iterate:

\begin{equation} \beta_{i} = \beta_{i} + \alpha \left[y^{(i)} - \sum_{j=1}^{n} \beta_{j} k\left(x^{(i)}, x^{(j)}\right)\right] \end{equation}

test Given x, compute \theta^{T} \phi\left(x\right) such that:

\begin{align} \theta^{T}\phi\left(x\right) &= \sum_{i=1}^{n} \beta_{i} \phi\left(x^{(i)}\right)^{T} \phi\left(x\right) \\ &= \sum_{i=1}^{n} \beta k\left(x^{(i)}, x\right) \end{align}

takes O\left(nd\right) time. additional information kernel smoothing see kernel smoothing some useful kernels monomials order up to t we have:

\begin{equation} k\left(x,z\right) = \left(x^{T}z + c\right)^{t} \end{equation}

which gives a feature map of all monomials x_{i}, x_{ij}, x_{ijk} \ldots which gives \mqty(d+t \\ t) dimension feature space. For \left(x^{T}z+c\right)^{2}, for instance, we have:

\begin{equation} \phi\left(x\right) = \mqty(c \\ \sqrt{2c} x_1 \\ \dots \\ x_{d} x_{d}) \end{equation}

which you can figure by FOILing the kernel function and trying to finagle a way to get each term. gaussian kernel see gaussian kernel, which has:

\begin{equation} k\left(x,z\right) = \exp \left(\frac{- \norm{x-z}^{2}_{2}}{2\sigma^{2}}\right) \end{equation}

what makes a valid kernel? This is a necessary and sufficient condition: Suppose there being n data points, x^{(i)}…x^{(n)}. Consider a structure called kernel matrix: K \in \mathbb{R}^{n \times n}, such that k_{ij} = k\left(x^{(i)}, x^{(j)}\right). A valid kernel needs that K \succ 0, namely that K is positive semi-definite. Consider what that would mean:

\begin{align} z^{T} k z &= \sum_{i=1}^{n} \sum_{j=1}^{n} z_{i} k_{ij} z_{j} \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n} z_{i} \phi\left(x^{(i)}\right)^{T} \phi\left(x^{(i)}\right) z_{j} \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n}z_{i} \sum_{l}^{} \left(\phi_{l}\left(x^{(i)}\right) \phi_{l}\left(x^{(j)}\right)\right) z_{j} \\ &= \sum_{l}^{} \left(\sum_{i}^{}\left(z_{i}\phi_{l}\left(x^{(i)}\right)\right)^{2}\right) \geq 0 \end{align}

This gives us that K being PSD is a necessary condition for being a kernel. What about sufficiency? We will proof by divine intervention: k is a valid kernel IFF for all n < \infty, we have for all x^{(1)} … x^{(n)}, the corresponding kernel matrix is PSD. deriving the kernel trick Consider now the update rule for \beta from above:

\begin{equation} \beta_{i} = \beta_{i} + \alpha \left(y^{(i)} - \theta^{T} \phi\left(x^{(i)}\right)\right) \end{equation}

plugging in the definition of \theta in terms of \beta above, we obtain:

\begin{align} \beta_{i} &= \beta_{i} + \alpha \left(y^{(i)} - \theta^{T} \phi\left(x^{(i)}\right)\right) \\ &= \beta_{i} + \alpha \left(y^{(i)} - \left(\sum_{j=1}^{n} \beta_{j} \phi\left(x^{(i)}\right)\right)\right)\phi\left(x^{(i)}\right) \\ &= \beta_{i} + \alpha \left(y^{(i)} - \sum_{j=1}^{n} \beta_{j}\left\langle \phi \left(x^{(j)}\right), \phi\left(x^{(i)}\right) \right\rangle\right) \end{align}

Which gives us two observations: \langle \phi \left(x^{(i)}\right), \phi \left(x^{(j)}\right) \rangle, \forall i,j can be precomputed often computing these inner products can be much faster than O\left(p\right) efficiently computing the kernel \langle \phi \left(x^{(i)}\right), \phi \left(x^{(j)}\right) \rangle Consider the following scheme to feature engineer your data:

\begin{equation} \phi \left(x\right) = \mqty[1 \\ x_1 \\ \dots \\ x_{d} \\ x_1 x_1 \\ x_1 x_2 \\ \dots \\ x_1x_1x_1 \\ x_1x_1x_2 \\ \dots \\ x_{d} x_{d} x_{d}] \end{equation}

How would we commute the inner productive above? Naively, for two data samples x,z:

\begin{equation} \langle \phi\left(x\right), \phi \left(z\right) \rangle = 1 + \sum_{i=1}^{d} x_{i} z_{i} + \sum_{i=1}^{d} \sum_{j=1}^{d} x_{i}x_{j}z_{i}z_{j}+ \sum_{i=1}^{d} \sum_{j=1}^{d} \sum_{k=1}^{d} x_{i}x_{j}x_{k} z_{i}z_{j}z_{k} \end{equation}

You will notice:

\begin{equation} \langle \phi\left(x\right), \phi \left(z\right) \rangle = 1 + \sum_{i=1}^{d} x_{i} z_{i} + \underbrace{\sum_{i=1}^{d} \sum_{j=1}^{d} x_{i}x_{j}z_{i}z_{j}}_{\langle x,z \rangle^{2}}+ \underbrace{\sum_{i=1}^{d} \sum_{j=1}^{d} \sum_{k=1}^{d} x_{i}x_{j}x_{k} z_{i}z_{j}z_{k}}_{\langle x,z \rangle^{3}} \end{equation}

So we can finally write:

\begin{equation} \langle \phi\left(x\right), \phi \left(z\right) \rangle = 1 + \langle x,z \rangle + \langle x,z \rangle^{2} + \langle x,z \rangle^{3} \end{equation}

which is better. As opposed to the naive O\left(np\right) solution for linear regression, we have O\left(n^{2}\right) per iteration and O\left(n^{2}d\right) for precomputation of the inner products above. This is nice because p maybe really bigger than n. motivation Suppose you have some kind of non-linear (say exponential) dataset… Consider now Linear Regression with a feature map

\begin{equation} \theta := \theta + \alpha \sum_{i=1}^{n} \left(y^{(i)} - \theta^{T}\phi\left(x^{(i)}\right)\right) \phi \left(x^{(i)}\right) \end{equation}

NOTICE that \theta is a linear combination of φ\left(x(i)\right)s; which we can write somewhat as:

\begin{equation} \theta = \sum_{i=1}^{n} \beta_{i} \phi\left(x^{(i)}\right) \end{equation}

observe: this is n variables in \beta_{j} instead of p variables. This is nice because if p was large feature map \phi could get really expensive. Substituting the above in for \theta, we obtain: THIS IS PROBABLY BAD AND WRONg

\begin{equation} \theta = \sum_{i=1}^{n} \underbrace{\left(\beta_{i} + \alpha\left(y^{(i)-\theta^{T}\phi\left(x^{(i)}\right)}\right)\right)}_{\text{new $\beta_i$}} \phi\left(x^{(i)}\right) \end{equation}