The explanation for Gaussian Processes from CS229 Notes is the best I found and understood
http://cs229.stanford.edu/section/cs229-gaussian_processes.pdf
Here, I'll be covering the main points behind this method, and I'll be using code from
https://blog.dominodatalab.com/fitting-gaussian-process-models-python/
who happens to have many great repositories with jupyter notebooks regarding ML
https://github.com/fonnesbeck?tab=repositories
http://cs229.stanford.edu/section/cs229-gaussian_processes.pdf
Here, I'll be covering the main points behind this method, and I'll be using code from
https://blog.dominodatalab.com/fitting-gaussian-process-models-python/
who happens to have many great repositories with jupyter notebooks regarding ML
https://github.com/fonnesbeck?tab=repositories
Notation
A training set of i.i.d. examples drawn from some unknown distribution. $$ S = \{(x^{(i)}, y^{(i)})\}_{i=1}^{m} $$Testing set of i.i.d. examples drawn from the same distribution as S. $$ T = \{(x^{(i)}_*, y^{(i)}_*)\}_{i=1}^{m} $$
We have to fit this model $$ \begin{align} y^{(i)} = w^T x^{(i)} + \epsilon^{(i)} \\ \epsilon^{(i)} \sim \mathcal N (0, \sigma^2) \space \space i.i.d \end{align} $$
But in a Gaussian Process (GP), the model will look like this $$ y^{(i)} = f(x^{(i)}) + \epsilon^{(i)} $$
or in matrix form $$ \vec{y} = f(X) + \vec{\epsilon} $$
because the point with GP is to find a distribution over the possible functions $f$ that are consistent with the observed data.
Partitioning
There is this common theorem:A random vector $\vec{x} \in \mathbb R^n$ with $\vec{x} \sim \mathcal N (\vec{\mu}, \Sigma)$ that is partitioned into two subvectors $\vec{x}_1 \in \mathbb R^p$ and $\vec{x}_2\in \mathbb R^q$ with $p+q = n$
$$ \vec{x} = \begin{bmatrix} \vec{x}_1 \\ \vec{x}_2 \end{bmatrix} $$
such that
$$ \vec{\mu}=\begin{bmatrix} \vec{\mu}_1 \\ \vec{\mu}_2 \end{bmatrix} $$
$$ \Sigma=\begin{bmatrix} \Sigma_{\vec{x}_1\vec{x}_1} & \Sigma_{\vec{x}_1\vec{x}_2}\\ \Sigma_{\vec{x}_2\vec{x}_1} & \Sigma_{\vec{x}_2\vec{x}_2} \end{bmatrix}=\begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix}=\begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{12}^T & \Sigma_{22} \end{bmatrix} $$
Because $\Sigma$ is symmetric, $\Sigma = \Sigma^T$ and $\Sigma_{21} = \Sigma_{12}^T$
In other words, there is this joint distribution
$$ \begin{bmatrix} \vec{x}_1 \\ \vec{x}_2 \end{bmatrix} \sim \mathcal{N}{\left( \begin{bmatrix} \vec{\mu}_1 \\ \vec{\mu}_2 \end{bmatrix} , \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{12}^T & \Sigma_{22} \end{bmatrix} \right)} $$
For that case, the following properties hold:
The derivation for this is here http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html
The part of the $\Sigma_{i|j}$ in the theorem isn't right though the rest is ok.
- Normalization: Probabilities sum 1
- Marginalization: The marginal distributions of $x_1$ and $x_2$ are Gaussian
- Conditioning: The conditional distribution of $\vec{x}_i$ given $\vec{x}_j$ is also normal with
The derivation for this is here http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html
The part of the $\Sigma_{i|j}$ in the theorem isn't right though the rest is ok.
Kernel Function
$$ K_{XX'} = K(X,X') = \left( \begin{matrix} k(x_1, x_1') \\ \vdots \\ k(x_m, x_1') \end{matrix} \begin{matrix} \cdots \\ \ddots \\ \cdots \end{matrix} \begin{matrix} k(x_1, x_m') \\ \vdots \\ k(x_m, x_m') \end{matrix} \right) $$And example of one is squared exponential kernel, a.k.a Gaussian kernel, a.k.a RBF kernel
$$ k_{\textrm{SE}}(x, x') = \sigma^2\exp\left(-\frac{(x - x')^2}{2\ell^2}\right) $$
In [1]:
import numpy as np
def exponential_cov(x, y, params):
return params[0]**2 * np.exp(-np.subtract.outer(x, y)**2 / (2*params[1]**2))
θ = [1, 1]
exponential_cov([1, 2], [1, 2], θ)
Out[1]:
Joint Distribution
I'll stop using the $\vec{}$ that refers to a vector.That theorem of partitioning is useful here because we assume that there was a partition that gave us $f$ and $f_*$
Consequently, $f$ and $f_*$ have their own mean and variance just like in the theorem.
$$ \begin{align} f \sim \mathcal N (\mu, \Sigma)\\ f_* \sim \mathcal N (\mu_*, \Sigma_{**}) \end{align} $$
However, $\Sigma$ is noted as $K$ for kernel because its computation will vary according to a kernel function.
$$ \begin{align} f \sim \mathcal N (\mu, K)\\ f_* \sim \mathcal N (\mu_{*}, K_{**}) \end{align} $$
$$ \begin{pmatrix} f \\ f_{*} \end{pmatrix} \sim \mathcal{N}{\left( \begin{pmatrix} \mu \\ \mu_{*} \end{pmatrix} , \begin{pmatrix} K & K_{*}\\ K_{*}^T & K_{**}\\ \end{pmatrix} \right)} $$
Where
$ f, \mu \in \mathbb R^m $
$ f_*, \mu_* \in \mathbb R^{m_*} $
$ K = K(X,X)$ is $m\times m $
$ K_* = K(X,X_*)$ is $m\times m_* $
$ K_{**} = K(X_*,X_*)$ is $m_*\times m_* $
Posterior
Because of the theorem$$ \begin{align} \vec{\mu}_{1|2} = \vec{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\vec{x}_2 - \vec{\mu}_2)\\ \Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{align} $$
The posterior $p(f_* | f) = p(f_* | f, X_*, X)$ is
$$ p(f_* | f, X_*, X) = \mathcal N(\mu^*, \Sigma^*) $$
$$ \begin{align} \mu^* = \mu_* +K_{*}^T K^{-1}(f - \mu)\\ \Sigma^* = K_{**} - K_{*}^T K^{-1} K_{*} \end{align} $$
Gaussian Process (GP)
It is denoted using this notation$$ f(x) \sim \mathcal{GP}(m(x), k(x,x')) $$
Or like this, as there are parameters the mean and covariance receive
$$ f(x) \sim \mathcal{GP}(m(x|\theta_m), k(x,x'|\theta_c)) $$
This will find a distribution of parameters over all possible functions $f(x)$ (this is constrained because they are too many).
In [2]:
import matplotlib.pylab as plt
import seaborn as sns
%matplotlib inline
In [3]:
def plot_gp(ax, kernel, plot_xs, params, n_samples=10, xlab=False, ylab=False):
sampled_funcs = np.random.multivariate_normal(np.zeros(len(plot_xs)), exponential_cov(plot_xs, plot_xs, params), size=3)
ax.plot(plot_xs, sampled_funcs.T)
ax.set_title(r'$\sigma = {},\/ \ell = {} $'.format(\
params[0], params[1]), fontsize = 22)
if xlab:
ax.set_xlabel(r'$x$',fontsize = 20)
if ylab:
ax.set_ylabel(r'$f(x)$',fontsize = 20)
fig = plt.figure(figsize=(20,8), facecolor='white')
ax_1 = fig.add_subplot(231, frameon=False)
ax_2 = fig.add_subplot(232, frameon=False)
ax_3 = fig.add_subplot(233, frameon=False)
ax_4 = fig.add_subplot(234, frameon=False)
ax_5 = fig.add_subplot(235, frameon=False)
ax_6 = fig.add_subplot(236, frameon=False)
ax_1.set_xticks([])
ax_1.set_yticks([])
ax_2.set_xticks([])
ax_2.set_yticks([])
ax_3.set_xticks([])
ax_3.set_yticks([])
ax_4.set_xticks([])
ax_4.set_yticks([])
ax_5.set_xticks([])
ax_5.set_yticks([])
ax_6.set_xticks([])
ax_6.set_yticks([])
plot_xs = np.linspace(-5, 5, 300)
plot_gp(ax_1, exponential_cov, plot_xs, [1, 0.05], 3, ylab=True)
plot_gp(ax_2, exponential_cov, plot_xs, [1, 0.1], 3, ylab=True)
plot_gp(ax_3, exponential_cov, plot_xs, [1, 0.5], 3, ylab=True)
plot_gp(ax_4, exponential_cov, plot_xs, [1, 1], 3, ylab=True)
plot_gp(ax_5, exponential_cov, plot_xs, [1, 3], 3, ylab=True)
plot_gp(ax_6, exponential_cov, plot_xs, [1, 16], 3, ylab=True)
The idea of prediction with Gaussian Processes boils down to
"Given my old x, and the y values for those x, what do I expect y to be for a new x?".
We are then given the following information:
x = [1, 2, 3, 5, 7, 10] y = [-0.3, 0.3, 0.5, -1.2, 0.7, -0.5]
Now, what is your best guess for $y$ when $x = 6$?
My best guess would be $y = 0$
To guess that with a GP, we have to implement
$$ p(f_* | f, X_*, X) = \mathcal N(\mu^*, \Sigma^*) $$
$$ \begin{align} \mu^* = \mu_* +K_{*}^T K^{-1}(f - \mu)\\ \Sigma^* = K_{**} - K_{*}^T K^{-1} K_{*} \end{align} $$
where
$ K = K(X,X)$ is $m\times m $
$ K_* = K(X,X_*)$ is $m\times m_* $
$ K_{**} = K(X_*,X_*)$ is $m_*\times m_* $
And I'll use the prior
$$ f(x) \sim \mathcal{GP}(m(x) = 0, k(x,x')) $$
Meaning that
$$ \begin{align} f \sim \mathcal N (\mu = \mathbf0_{m}, K)\\ f_* \sim \mathcal N (\mu_{*} = \mathbf0_{m_*}, K_{**}) \end{align} $$
"Given my old x, and the y values for those x, what do I expect y to be for a new x?".
We are then given the following information:
x = [1, 2, 3, 5, 7, 10] y = [-0.3, 0.3, 0.5, -1.2, 0.7, -0.5]
Now, what is your best guess for $y$ when $x = 6$?
My best guess would be $y = 0$
To guess that with a GP, we have to implement
$$ p(f_* | f, X_*, X) = \mathcal N(\mu^*, \Sigma^*) $$
$$ \begin{align} \mu^* = \mu_* +K_{*}^T K^{-1}(f - \mu)\\ \Sigma^* = K_{**} - K_{*}^T K^{-1} K_{*} \end{align} $$
where
$ K = K(X,X)$ is $m\times m $
$ K_* = K(X,X_*)$ is $m\times m_* $
$ K_{**} = K(X_*,X_*)$ is $m_*\times m_* $
And I'll use the prior
$$ f(x) \sim \mathcal{GP}(m(x) = 0, k(x,x')) $$
Meaning that
$$ \begin{align} f \sim \mathcal N (\mu = \mathbf0_{m}, K)\\ f_* \sim \mathcal N (\mu_{*} = \mathbf0_{m_*}, K_{**}) \end{align} $$
Because that is the most common prior, the posterior is normally this one
$$ p(f_* | f, X_*, X) = \mathcal N(\mu^*, \Sigma^*) $$
$$ \begin{align} \mu^* = K_{*}^T K^{-1}f\\ \Sigma^* = K_{**} - K_{*}^T K^{-1} K_{*} \end{align} $$
or
$$ p(y_* | y, X_*, X) = \mathcal N(K_{*}^T K^{-1}y, K_{**} - K_{*}^T K^{-1} K_{*}) $$
$$ p(f_* | f, X_*, X) = \mathcal N(\mu^*, \Sigma^*) $$
$$ \begin{align} \mu^* = K_{*}^T K^{-1}f\\ \Sigma^* = K_{**} - K_{*}^T K^{-1} K_{*} \end{align} $$
or
$$ p(y_* | y, X_*, X) = \mathcal N(K_{*}^T K^{-1}y, K_{**} - K_{*}^T K^{-1} K_{*}) $$
In [4]:
x = np.array([1, 2, 3, 5, 7, 10])
y = [-0.3, 0.3, 0.5, -1.2, 0.7, -0.5]
In [5]:
def posterior(x_new, x, y, kernel, params):
k = kernel(x, x, params)
k_s = kernel(x, x_new, params)
k_ss = kernel(x_new, x_new, params)
post_mean = (k_s.T).dot(np.linalg.pinv(k)).dot(y)
post_variance = k_ss - (k_s.T).dot(np.linalg.pinv(k)).dot(k_s)
return post_mean, post_variance
Predicting a value
In [6]:
params = [1, 1]
x_new = [6]
mean, variance = posterior(x_new, x, y, exponential_cov, params)
mean, variance
Out[6]:
The mean is approximately the true value of y_new
In [7]:
x_pts = np.arange(x.min()-5, x.max()+5, step=0.01)
mean, variance = posterior(x_pts, x, y, exponential_cov, params)
y_pts = np.random.multivariate_normal(mean, variance)
In [8]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x_pts, y_pts, yerr=variance.diagonal(), capsize=0)
ax.scatter(x, y, c='red')
Out[8]:
In [9]:
np.array([x]).T
Out[9]:
In [10]:
from sklearn import gaussian_process
from sklearn.gaussian_process.kernels import RBF
kernel = RBF()
gp = gaussian_process.GaussianProcessRegressor(kernel=kernel)
gp.fit(np.array([x]).T, y)
Out[10]:
In [11]:
mean, variance = gp.predict(np.array([x_pts]).T, return_cov=True)
In [12]:
y_pts = np.random.multivariate_normal(mean, variance)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x_pts, y_pts, yerr=variance.diagonal(), capsize=0)
ax.scatter(x, y, c='red')
Out[12]: