Kernels provide a bridge between linearity and non-linearity.

If an algorithm can be expressed only in terms of a inner product between two vectors, all we need is replace this inner product with the inner product from some other suitable space.
Kernel functions must be continuous, symmetric, and most preferably should have a positive (semi-) definite Gram matrix.
The use of a positive definite kernel insures that the optimization problem will be convex and the solution will be unique.

This can be a model for function $$ y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1 x_2 $$

It can be written like this
$$ y = \theta_0 + \theta_1 f_1 + \theta_2 f_2 + \theta_3 f_3 $$
where $f_1=x_1$, $f_2=x_2$, $f_3=x_1 x_2$
Features $f_i$ can be calculated by using a kernel. So, the values of every sample $x$ can be converted into something else.

Gaussian kernel

A kernel defines
$$ k(x, x') = exp\Big(\frac{-||x-x'||^2} {2 \sigma^2}\Big) $$
$x'$ acts like the mean

In [1]:

import numpy as np
import matplotlib.pylab as plt
%matplotlib inline

def gaussian_kernel(x, xp, sigma_sq):
    return np.exp((-(np.linalg.norm(x-xp)**2))/(2*sigma_sq**2))

landmark = np.array([3,2])
sigma_sq = 1

In [2]:

levels = 10

xlim = (1, 5)
ylim = (0, 4)
ngrid = 100

x = np.linspace(xlim[0], xlim[1], ngrid, endpoint=True)
y = np.linspace(ylim[0], ylim[1], ngrid, endpoint=True)

X, Y = np.meshgrid(x, y)
xy = np.vstack([X.ravel(), Y.ravel()]).T
z = np.array([gaussian_kernel(xy_elem, landmark, sigma_sq) for xy_elem in xy])

z = z.reshape((ngrid, ngrid))
plt.contourf(x, y, z, levels, cmap=plt.cm.jet)
plt.show()

Depending on how close a sample is from the landmark, a new value will replace it.

In [3]:

gaussian_kernel([3,3],landmark,sigma_sq)

Out[3]:

0.60653065971263342

In [4]:

gaussian_kernel([2,4],landmark,sigma_sq)

Out[4]:

0.082084998623898758

A kernel acts like a measure of similarity!

Using a kernel

In [5]:

from scipy.io import loadmat
import pandas as pd

raw_data = loadmat('ex6data2.mat')
data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']

In [6]:

positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(positive['X1'], positive['X2'], s=5, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=5, marker='o', label='Negative')
ax.legend()

Out[6]:

<matplotlib.legend.Legend at 0x10690cb00>

How to find a decision boundary for this one?
If I define 4 landmarks, we may find it because there can be a function
$$ y = \theta_0 + \theta_1 f_1 + \theta_2 f_2 + \theta_3 f_3 + \theta_4 f_4 $$
that predicts "Positive" if a sample is far from those landmarks by using a gaussian kernel.
It can be this one $y = f_1 + f_2 + f_3 + f_4$, where the closest it is to 0, the more likely a sample will be "Positive"

In [7]:

fig = plt.figure()
ax = fig.add_subplot(111)

a = [0.3, 0.75]
b = [0.4, 0.6]
c = [0.6, 0.75]
d = [0.8, 0.6]
ax.plot(a[0], a[1], marker='.', markersize=30, color="black")
ax.plot(b[0], b[1], marker='.', markersize=30, color="black")
ax.plot(c[0], c[1], marker='.', markersize=30, color="black")
ax.plot(d[0], d[1], marker='.', markersize=30, color="black")

ax.scatter(positive['X1'], positive['X2'], s=5, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=5, marker='o', label='Negative')
ax.legend()

Out[7]:

<matplotlib.legend.Legend at 0x10347b7f0>

Kernels in SVMs

How to choose landmarks?
SVM puts landmarks at exactly each training example.
Hence, there will be $m$ landmarks because there are $m$ training examples.

Given $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)})$
Choose $l^{(1)} = x^{(1)}, \ldots, l^{(m)} = x^{(m)}$
For a training example $(x^{(i)}, y^{(i)})$:
$$ \begin{align} f^{(i)} &= \begin{bmatrix} k(x^{(i)}, l^{(1)}) \\ k(x^{(i)}, l^{(2)}) \\ \vdots \\ k(x^{(i)}, l^{(m)}) \end{bmatrix} \end{align} $$
So, this minimization from "Coursera" might work if I implement it
$$ \min_{\theta} C\sum_{i=1}^m \bigg[y^{(i)}cost_1(\theta^T f^{(i)}) + (1-y^{(i)})cost_0(\theta^T f^{(i)})\bigg] + \frac{1}{2} \sum_{j=1}^{m} \theta_j^2 $$

Another way to see the problem

In http://cs229.stanford.edu/notes/cs229-notes3.pdf, another more complex algorithm for modeling an SVM is demonstrated where there's this equation

Which is used for finding $w$ in
$$ y = w^T x + b $$
There's this inner product $\langle x^{(i)}, x^{(j)}\rangle = (x^{(i)})^T x^{(j)}$.

The kernel trick

There's a problem with SVMs. It performs very badly with datasets that are not linearly separable like the one in the example.
The kernel trick consists merely on replacing $\langle x^{(i)}, x^{(j)}\rangle$ by
Linear kernel
$$ k(x^{(i)}, x^{(j)}) = (x^{(i)})^T x^{(j)} $$
Gaussian kernel
$$ k(x^{(i)}, x^{(j)}) = exp\Big(\frac{-||x^{(i)} - x^{(j)}||^2} {2 \sigma^2}\Big) $$
Basically, any learning algorithm that you can write in terms of only inner products between input attribute vectors can take advantage of the kernel trick.

Mapping function

Another kernel can be this one
$$ K(x, x') = \phi(x)^T\phi(z') $$
where
$$ \phi(x) = [\phi(x), \ldots, \phi_H(x)]^T $$
where the mapping can be a polynomial of degree $H$ like this one $\phi(x) = [1, x, x^2, \ldots, x^{H-1}]^T$ that is normally used in nonlinear regression.
Or it can simply be $\phi(x) = x$ in which case the result would be simple linear kernel.
There's also this one called "Design Matrix" where N would be the number of input data points $x^{(i)}$

These mapping functions are not so related to kernels, but I put it here because it kept me confused for a while.

Scikit-learn

In [8]:

from sklearn import svm  

## This uses an RBF or Gaussian kernel
svc = svm.SVC(C=100, gamma=10, probability=True)  
svc.fit(data[['X1', 'X2']], data['y'])  
data['y_estimated'] = svc.predict(data[['X1', 'X2']])

plt.scatter(data['X1'], data['X2'], s=5, c=data['y_estimated'])

Out[8]:

<matplotlib.collections.PathCollection at 0x10759c438>

Kernel Functions Explained