Loading web-font TeX/Math/Italic

Gaussian Processes Explained

The explanation for Gaussian Processes from CS229 Notes is the best I found and understood
http://cs229.stanford.edu/section/cs229-gaussian_processes.pdf
Here, I'll be covering the main points behind this method, and I'll be using code from
https://blog.dominodatalab.com/fitting-gaussian-process-models-python/
who happens to have many great repositories with jupyter notebooks regarding ML
https://github.com/fonnesbeck?tab=repositories

Notation

A training set of i.i.d. examples drawn from some unknown distribution. S = \{(x^{(i)}, y^{(i)})\}_{i=1}^{m}

Testing set of i.i.d. examples drawn from the same distribution as S. T = \{(x^{(i)}_*, y^{(i)}_*)\}_{i=1}^{m}

We have to fit this model \begin{align} y^{(i)} = w^T x^{(i)} + \epsilon^{(i)} \\ \epsilon^{(i)} \sim \mathcal N (0, \sigma^2) \space \space i.i.d \end{align}

But in a Gaussian Process (GP), the model will look like this y^{(i)} = f(x^{(i)}) + \epsilon^{(i)}

or in matrix form \vec{y} = f(X) + \vec{\epsilon}

because the point with GP is to find a distribution over the possible functions f that are consistent with the observed data.

Data

When modeling f as a GP, there is
Training set:
Testing set:
We will basically only use X and f, which are observed data, and X_* and f_*, which are unobserved data.
And in the end this, the goal is to estimate outputs f_* for a set of new inputs X_*

Partitioning

There is this common theorem:
A random vector \vec{x} \in \mathbb R^n with \vec{x} \sim \mathcal N (\vec{\mu}, \Sigma) that is partitioned into two subvectors \vec{x}_1 \in \mathbb R^p and \vec{x}_2\in \mathbb R^q with p+q = n
\vec{x} = \begin{bmatrix} \vec{x}_1 \\ \vec{x}_2 \end{bmatrix}

such that
\vec{\mu}=\begin{bmatrix} \vec{\mu}_1 \\ \vec{\mu}_2 \end{bmatrix}

\Sigma=\begin{bmatrix} \Sigma_{\vec{x}_1\vec{x}_1} & \Sigma_{\vec{x}_1\vec{x}_2}\\ \Sigma_{\vec{x}_2\vec{x}_1} & \Sigma_{\vec{x}_2\vec{x}_2} \end{bmatrix}=\begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix}=\begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{12}^T & \Sigma_{22} \end{bmatrix}

Because \Sigma is symmetric, \Sigma = \Sigma^T and \Sigma_{21} = \Sigma_{12}^T
In other words, there is this joint distribution
\begin{bmatrix} \vec{x}_1 \\ \vec{x}_2 \end{bmatrix} \sim \mathcal{N}{\left( \begin{bmatrix} \vec{\mu}_1 \\ \vec{\mu}_2 \end{bmatrix} , \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{12}^T & \Sigma_{22} \end{bmatrix} \right)}
For that case, the following properties hold:
  • Normalization: Probabilities sum 1
  • Marginalization: The marginal distributions of x_1 and x_2 are Gaussian
\begin{align} \vec{x}_1 \sim \mathcal N (\vec{\mu}_1, \Sigma_{11})\\ \vec{x}_2 \sim \mathcal N (\vec{\mu}_2, \Sigma_{22}) \end{align}

  • Conditioning: The conditional distribution of \vec{x}_i given \vec{x}_j is also normal with
\begin{align} \vec{\mu}_{1|2} = \vec{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\vec{x}_2 - \vec{\mu}_2)\\ \Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{align}

The derivation for this is here http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html
The part of the \Sigma_{i|j} in the theorem isn't right though the rest is ok.

Kernel Function

K_{XX'} = K(X,X') = \left( \begin{matrix} k(x_1, x_1') \\ \vdots \\ k(x_m, x_1') \end{matrix} \begin{matrix} \cdots \\ \ddots \\ \cdots \end{matrix} \begin{matrix} k(x_1, x_m') \\ \vdots \\ k(x_m, x_m') \end{matrix} \right)

And example of one is squared exponential kernel, a.k.a Gaussian kernel, a.k.a RBF kernel
k_{\textrm{SE}}(x, x') = \sigma^2\exp\left(-\frac{(x - x')^2}{2\ell^2}\right)
In [1]:
import numpy as np

def exponential_cov(x, y, params):
    return params[0]**2 * np.exp(-np.subtract.outer(x, y)**2 / (2*params[1]**2))

θ = [1, 1]
exponential_cov([1, 2], [1, 2], θ)
Out[1]:
array([[1.        , 0.60653066],
       [0.60653066, 1.        ]])

Joint Distribution

I'll stop using the \vec{} that refers to a vector.
That theorem of partitioning is useful here because we assume that there was a partition that gave us f and f_*
Consequently, f and f_* have their own mean and variance just like in the theorem.
\begin{align} f \sim \mathcal N (\mu, \Sigma)\\ f_* \sim \mathcal N (\mu_*, \Sigma_{**}) \end{align}

However, \Sigma is noted as K for kernel because its computation will vary according to a kernel function.
\begin{align} f \sim \mathcal N (\mu, K)\\ f_* \sim \mathcal N (\mu_{*}, K_{**}) \end{align}

\begin{pmatrix} f \\ f_{*} \end{pmatrix} \sim \mathcal{N}{\left( \begin{pmatrix} \mu \\ \mu_{*} \end{pmatrix} , \begin{pmatrix} K & K_{*}\\ K_{*}^T & K_{**}\\ \end{pmatrix} \right)}

Where
f, \mu \in \mathbb R^m
f_*, \mu_* \in \mathbb R^{m_*}
K = K(X,X) is m\times m
K_* = K(X,X_*) is m\times m_*
K_{**} = K(X_*,X_*) is m_*\times m_*

Posterior

Because of the theorem
\begin{align} \vec{\mu}_{1|2} = \vec{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\vec{x}_2 - \vec{\mu}_2)\\ \Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{align}

The posterior p(f_* | f) = p(f_* | f, X_*, X) is
p(f_* | f, X_*, X) = \mathcal N(\mu^*, \Sigma^*)

\begin{align} \mu^* = \mu_* +K_{*}^T K^{-1}(f - \mu)\\ \Sigma^* = K_{**} - K_{*}^T K^{-1} K_{*} \end{align}

Prior

As with every bayesian method, a prior probability p(f_*) has to be defined.

Gaussian Process (GP)

It is denoted using this notation
f(x) \sim \mathcal{GP}(m(x), k(x,x'))

Or like this, as there are parameters the mean and covariance receive
f(x) \sim \mathcal{GP}(m(x|\theta_m), k(x,x'|\theta_c))

This will find a distribution of parameters over all possible functions f(x) (this is constrained because they are too many).

Sample functions from some priors

In [2]:
import matplotlib.pylab as plt
import seaborn as sns
%matplotlib inline
In [3]:
def plot_gp(ax, kernel, plot_xs, params, n_samples=10, xlab=False, ylab=False):
    sampled_funcs = np.random.multivariate_normal(np.zeros(len(plot_xs)), exponential_cov(plot_xs, plot_xs, params), size=3)
    ax.plot(plot_xs, sampled_funcs.T)
    ax.set_title(r'$\sigma = {},\/ \ell = {} $'.format(\
        params[0], params[1]), fontsize = 22)
    if xlab:
        ax.set_xlabel(r'$x$',fontsize = 20)
    if ylab:
        ax.set_ylabel(r'$f(x)$',fontsize = 20)

fig = plt.figure(figsize=(20,8), facecolor='white')
ax_1 = fig.add_subplot(231, frameon=False)
ax_2 = fig.add_subplot(232, frameon=False)
ax_3 = fig.add_subplot(233, frameon=False)
ax_4 = fig.add_subplot(234, frameon=False)
ax_5 = fig.add_subplot(235, frameon=False)
ax_6 = fig.add_subplot(236, frameon=False)
ax_1.set_xticks([])
ax_1.set_yticks([])
ax_2.set_xticks([])
ax_2.set_yticks([])
ax_3.set_xticks([])
ax_3.set_yticks([])
ax_4.set_xticks([])
ax_4.set_yticks([])
ax_5.set_xticks([])
ax_5.set_yticks([])
ax_6.set_xticks([])
ax_6.set_yticks([])

plot_xs = np.linspace(-5, 5, 300)
plot_gp(ax_1, exponential_cov, plot_xs, [1, 0.05], 3, ylab=True)
plot_gp(ax_2, exponential_cov, plot_xs, [1, 0.1], 3, ylab=True)
plot_gp(ax_3, exponential_cov, plot_xs, [1, 0.5], 3, ylab=True)
plot_gp(ax_4, exponential_cov, plot_xs, [1, 1], 3, ylab=True)
plot_gp(ax_5, exponential_cov, plot_xs, [1, 3], 3, ylab=True)
plot_gp(ax_6, exponential_cov, plot_xs, [1, 16], 3, ylab=True)

Prediction

The idea of prediction with Gaussian Processes boils down to
"Given my old x, and the y values for those x, what do I expect y to be for a new x?".
We are then given the following information:
x = [1, 2, 3, 5, 7, 10] y = [-0.3, 0.3, 0.5, -1.2, 0.7, -0.5]
Now, what is your best guess for y when x = 6?
My best guess would be y = 0
To guess that with a GP, we have to implement
p(f_* | f, X_*, X) = \mathcal N(\mu^*, \Sigma^*)

\begin{align} \mu^* = \mu_* +K_{*}^T K^{-1}(f - \mu)\\ \Sigma^* = K_{**} - K_{*}^T K^{-1} K_{*} \end{align}

where
K = K(X,X) is m\times m
K_* = K(X,X_*) is m\times m_*
K_{**} = K(X_*,X_*) is m_*\times m_*
And I'll use the prior
f(x) \sim \mathcal{GP}(m(x) = 0, k(x,x'))

Meaning that
\begin{align} f \sim \mathcal N (\mu = \mathbf0_{m}, K)\\ f_* \sim \mathcal N (\mu_{*} = \mathbf0_{m_*}, K_{**}) \end{align}
Because that is the most common prior, the posterior is normally this one
p(f_* | f, X_*, X) = \mathcal N(\mu^*, \Sigma^*)

\begin{align} \mu^* = K_{*}^T K^{-1}f\\ \Sigma^* = K_{**} - K_{*}^T K^{-1} K_{*} \end{align}

or
p(y_* | y, X_*, X) = \mathcal N(K_{*}^T K^{-1}y, K_{**} - K_{*}^T K^{-1} K_{*})
In [4]:
x = np.array([1, 2, 3, 5, 7, 10])
y = [-0.3, 0.3, 0.5, -1.2, 0.7, -0.5]
In [5]:
def posterior(x_new, x, y, kernel, params):
    k = kernel(x, x, params)
    k_s = kernel(x, x_new, params)
    k_ss = kernel(x_new, x_new, params)
    post_mean = (k_s.T).dot(np.linalg.pinv(k)).dot(y)
    post_variance = k_ss - (k_s.T).dot(np.linalg.pinv(k)).dot(k_s)
    return post_mean, post_variance

Fit and predict

A train function or "fit" in scikit learn will only store the values of x and y in memory because, unlike other types of learning algorithms, when having test data to evaluate, this method needs to consider the whole dataset again. As a consequence, "predicting" is a costly process.
Predicting a value
In [6]:
params = [1, 1]

x_new = [6]
mean, variance = posterior(x_new, x, y, exponential_cov, params)
mean, variance
Out[6]:
(array([-0.30152242]), array([[0.34578349]]))
The mean is approximately the true value of y_new

Plotting a confidence region around a expected function

In [7]:
x_pts = np.arange(x.min()-5, x.max()+5, step=0.01)
mean, variance = posterior(x_pts, x, y, exponential_cov, params)
y_pts = np.random.multivariate_normal(mean, variance)
In [8]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x_pts, y_pts, yerr=variance.diagonal(), capsize=0)
ax.scatter(x, y, c='red')
Out[8]:
<matplotlib.collections.PathCollection at 0x108346dd8>

Scikit learn

In [9]:
np.array([x]).T
Out[9]:
array([[ 1],
       [ 2],
       [ 3],
       [ 5],
       [ 7],
       [10]])
In [10]:
from sklearn import gaussian_process
from sklearn.gaussian_process.kernels import RBF

kernel = RBF()
gp = gaussian_process.GaussianProcessRegressor(kernel=kernel)
gp.fit(np.array([x]).T, y)
Out[10]:
GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
             kernel=RBF(length_scale=1), n_restarts_optimizer=0,
             normalize_y=False, optimizer='fmin_l_bfgs_b',
             random_state=None)
In [11]:
mean, variance = gp.predict(np.array([x_pts]).T, return_cov=True)
In [12]:
y_pts = np.random.multivariate_normal(mean, variance)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x_pts, y_pts, yerr=variance.diagonal(), capsize=0)
ax.scatter(x, y, c='red')
Out[12]:
<matplotlib.collections.PathCollection at 0x10d02ac18>