Kernels provide a bridge between linearity and non-linearity.
Which is used for finding $w$ in
$$ y = w^T x + b $$
There's this inner product $\langle x^{(i)}, x^{(j)}\rangle = (x^{(i)})^T x^{(j)}$.
The kernel trick consists merely on replacing $\langle x^{(i)}, x^{(j)}\rangle$ by
Linear kernel
$$ k(x^{(i)}, x^{(j)}) = (x^{(i)})^T x^{(j)} $$
Gaussian kernel
$$ k(x^{(i)}, x^{(j)}) = exp\Big(\frac{-||x^{(i)} - x^{(j)}||^2} {2 \sigma^2}\Big) $$
Basically, any learning algorithm that you can write in terms of only inner products between input attribute vectors can take advantage of the kernel trick.
$$ K(x, x') = \phi(x)^T\phi(z') $$
where
$$ \phi(x) = [\phi(x), \ldots, \phi_H(x)]^T $$
where the mapping can be a polynomial of degree $H$ like this one $\phi(x) = [1, x, x^2, \ldots, x^{H-1}]^T$ that is normally used in nonlinear regression.
Or it can simply be $\phi(x) = x$ in which case the result would be simple linear kernel.
There's also this one called "Design Matrix" where N would be the number of input data points $x^{(i)}$
These mapping functions are not so related to kernels, but I put it here because it kept me confused for a while.
- If an algorithm can be expressed only in terms of a inner product between two vectors, all we need is replace this inner product with the inner product from some other suitable space.
- Kernel functions must be continuous, symmetric, and most preferably should have a positive (semi-) definite Gram matrix.
- The use of a positive definite kernel insures that the optimization problem will be convex and the solution will be unique.
This can be a model for function
$$
y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1 x_2
$$
It can be written like this
$$ y = \theta_0 + \theta_1 f_1 + \theta_2 f_2 + \theta_3 f_3 $$
where $f_1=x_1$, $f_2=x_2$, $f_3=x_1 x_2$
Features $f_i$ can be calculated by using a kernel. So, the values of every sample $x$ can be converted into something else.
$$ y = \theta_0 + \theta_1 f_1 + \theta_2 f_2 + \theta_3 f_3 $$
where $f_1=x_1$, $f_2=x_2$, $f_3=x_1 x_2$
Features $f_i$ can be calculated by using a kernel. So, the values of every sample $x$ can be converted into something else.
In [1]:
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
def gaussian_kernel(x, xp, sigma_sq):
return np.exp((-(np.linalg.norm(x-xp)**2))/(2*sigma_sq**2))
landmark = np.array([3,2])
sigma_sq = 1
In [2]:
levels = 10
xlim = (1, 5)
ylim = (0, 4)
ngrid = 100
x = np.linspace(xlim[0], xlim[1], ngrid, endpoint=True)
y = np.linspace(ylim[0], ylim[1], ngrid, endpoint=True)
X, Y = np.meshgrid(x, y)
xy = np.vstack([X.ravel(), Y.ravel()]).T
z = np.array([gaussian_kernel(xy_elem, landmark, sigma_sq) for xy_elem in xy])
z = z.reshape((ngrid, ngrid))
plt.contourf(x, y, z, levels, cmap=plt.cm.jet)
plt.show()
Depending on how close a sample is from the landmark, a new value will replace it.
In [3]:
gaussian_kernel([3,3],landmark,sigma_sq)
Out[3]:
In [4]:
gaussian_kernel([2,4],landmark,sigma_sq)
Out[4]:
A kernel acts like a measure of similarity!
In [5]:
from scipy.io import loadmat
import pandas as pd
raw_data = loadmat('ex6data2.mat')
data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']
In [6]:
positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(positive['X1'], positive['X2'], s=5, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=5, marker='o', label='Negative')
ax.legend()
Out[6]:
How to find a decision boundary for this one?
If I define 4 landmarks, we may find it because there can be a function
$$ y = \theta_0 + \theta_1 f_1 + \theta_2 f_2 + \theta_3 f_3 + \theta_4 f_4 $$
that predicts "Positive" if a sample is far from those landmarks by using a gaussian kernel.
It can be this one $y = f_1 + f_2 + f_3 + f_4$, where the closest it is to 0, the more likely a sample will be "Positive"
If I define 4 landmarks, we may find it because there can be a function
$$ y = \theta_0 + \theta_1 f_1 + \theta_2 f_2 + \theta_3 f_3 + \theta_4 f_4 $$
that predicts "Positive" if a sample is far from those landmarks by using a gaussian kernel.
It can be this one $y = f_1 + f_2 + f_3 + f_4$, where the closest it is to 0, the more likely a sample will be "Positive"
In [7]:
fig = plt.figure()
ax = fig.add_subplot(111)
a = [0.3, 0.75]
b = [0.4, 0.6]
c = [0.6, 0.75]
d = [0.8, 0.6]
ax.plot(a[0], a[1], marker='.', markersize=30, color="black")
ax.plot(b[0], b[1], marker='.', markersize=30, color="black")
ax.plot(c[0], c[1], marker='.', markersize=30, color="black")
ax.plot(d[0], d[1], marker='.', markersize=30, color="black")
ax.scatter(positive['X1'], positive['X2'], s=5, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=5, marker='o', label='Negative')
ax.legend()
Out[7]:
Given $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)})$
Choose $l^{(1)} = x^{(1)}, \ldots, l^{(m)} = x^{(m)}$
For a training example $(x^{(i)}, y^{(i)})$:
$$ \begin{align} f^{(i)} &= \begin{bmatrix} k(x^{(i)}, l^{(1)}) \\ k(x^{(i)}, l^{(2)}) \\ \vdots \\ k(x^{(i)}, l^{(m)}) \end{bmatrix} \end{align} $$
So, this minimization from "Coursera" might work if I implement it
$$ \min_{\theta} C\sum_{i=1}^m \bigg[y^{(i)}cost_1(\theta^T f^{(i)}) + (1-y^{(i)})cost_0(\theta^T f^{(i)})\bigg] + \frac{1}{2} \sum_{j=1}^{m} \theta_j^2 $$
Choose $l^{(1)} = x^{(1)}, \ldots, l^{(m)} = x^{(m)}$
For a training example $(x^{(i)}, y^{(i)})$:
$$ \begin{align} f^{(i)} &= \begin{bmatrix} k(x^{(i)}, l^{(1)}) \\ k(x^{(i)}, l^{(2)}) \\ \vdots \\ k(x^{(i)}, l^{(m)}) \end{bmatrix} \end{align} $$
So, this minimization from "Coursera" might work if I implement it
$$ \min_{\theta} C\sum_{i=1}^m \bigg[y^{(i)}cost_1(\theta^T f^{(i)}) + (1-y^{(i)})cost_0(\theta^T f^{(i)})\bigg] + \frac{1}{2} \sum_{j=1}^{m} \theta_j^2 $$
Another way to see the problem
In http://cs229.stanford.edu/notes/cs229-notes3.pdf, another more complex algorithm for modeling an SVM is demonstrated where there's this equationWhich is used for finding $w$ in
$$ y = w^T x + b $$
There's this inner product $\langle x^{(i)}, x^{(j)}\rangle = (x^{(i)})^T x^{(j)}$.
The kernel trick
There's a problem with SVMs. It performs very badly with datasets that are not linearly separable like the one in the example.The kernel trick consists merely on replacing $\langle x^{(i)}, x^{(j)}\rangle$ by
Linear kernel
$$ k(x^{(i)}, x^{(j)}) = (x^{(i)})^T x^{(j)} $$
Gaussian kernel
$$ k(x^{(i)}, x^{(j)}) = exp\Big(\frac{-||x^{(i)} - x^{(j)}||^2} {2 \sigma^2}\Big) $$
Basically, any learning algorithm that you can write in terms of only inner products between input attribute vectors can take advantage of the kernel trick.
Mapping function
Another kernel can be this one$$ K(x, x') = \phi(x)^T\phi(z') $$
where
$$ \phi(x) = [\phi(x), \ldots, \phi_H(x)]^T $$
where the mapping can be a polynomial of degree $H$ like this one $\phi(x) = [1, x, x^2, \ldots, x^{H-1}]^T$ that is normally used in nonlinear regression.
Or it can simply be $\phi(x) = x$ in which case the result would be simple linear kernel.
There's also this one called "Design Matrix" where N would be the number of input data points $x^{(i)}$
These mapping functions are not so related to kernels, but I put it here because it kept me confused for a while.
In [8]:
from sklearn import svm
## This uses an RBF or Gaussian kernel
svc = svm.SVC(C=100, gamma=10, probability=True)
svc.fit(data[['X1', 'X2']], data['y'])
data['y_estimated'] = svc.predict(data[['X1', 'X2']])
plt.scatter(data['X1'], data['X2'], s=5, c=data['y_estimated'])
Out[8]: