Sampling: Gibbs Sampling Explained

I'll be using examples from
https://www.youtube.com/watch?v=a_08GKWHFWo
https://wiseodd.github.io/techblog/2015/10/09/gibbs-sampling/

Monte Carlo

Origin: Manhattan project (the atomic bomb)
Monte Carlo method is a general term for a broad class of algorithms that use random sampling to compute some numerical result. All this comes from the idea of simulating stuff not mentally but automatically.

These methods are often used when it is difficult or even impossible to compute things directly. Examples of applications involve optimization, numerical integration, and sampling from a probability distribution.

Markov Chain Monte Carlo (MCMC)

The previous sampling methods (rejection sampling) produced independent samples from the target distribution. Markov Chain Monte Carlo (MCMC) methods instead produce a sequence of dependent samples which nevertheless has a desired marginal distribution.
A sequence of random variables $X_1, X_2, \ldots$ is called a Markov chain if their joint density can be factorized as follows:
$$ p(\mathbf x) = p(x_1, x_2, \ldots) = p(x_1) \prod_{i=1}^{\infty} p(x_{i+1} | x_1, \ldots, x_{i}) = p(x_1) \prod_{i=1}^{\infty} p(x_{i+1} | x_i) $$
Thus, each variable is independent of its history, given its direct predecessor. A Markov chain is called homogeneous if its transition density $p(x_{i+1}|x_i)$ is the same for all $i \in \mathbb{N}$.

Basic algorithm

These methods are simply a class of algorithms that use Markov Chains to sample from a particular probability distribution (the Monte Carlo part). They work by creating a Markov Chain where the limiting distribution (or stationary distribution) is simply the distribution we want to sample.
The basic algorithm for sampling using an MCMC method operates as follows:
  1. Start at an arbitrary point $x$.
  2. Jump to point $x'$ with a certain transition probability (this may mean staying in the same state).
  3. Go to step 2 until we have transitioned $N$ times.
  4. Record current state $x'$, go to step 2.

Detailed Balance Equation

MCMC constructs a Markov chain with a desired stationary distribution $p(\mathbf x)$. The stationary distribution $p^*(\mathbf x)$ of a homogeneous Markov chain with transition density $p(\mathbf x' | \mathbf x)$ has the property
$$ p^*(\mathbf x') = \int p(\mathbf x' | \mathbf x) p^*(\mathbf x) d\mathbf x, $$
which is just another way to write what's in the "Markov Chains" notes $p^* = \pi$
$$ \begin{align} \pi(x') &= \int T(x \to x') \pi(x) dx\\ \pi(x') &= \sum_x T(x \to x') \pi(x) \end{align} $$
i.e. the distribution is unchanged by the action of the Markov chain.
Because that last equation is hard to check, there's a better way to verify that $\pi$ is a stationary distribution.
A sufficient condition to ensure that $p(x)$ is invariant is that the transition probability satisfies detailed balance.
$$ \pi(x) T(x \to x') = \pi(x') T(x' \to x) $$
Proof:
$$ \begin{align} \sum_x \pi(x) T(x \to x') &= \sum_x \pi(x') T(x' \to x) \\ &= \pi(x') \sum_x T(x' \to x) \\ &= \pi(x') \end{align} $$
A Markov chain which satisfies detailed balance is reversible.
Ideally, we want that the chain converges to its target distribution $\pi(x)$, independent of where it started. It has to have a stationary distribution.

Gibbs Sampling

Gibbs Sampling is an MCMC method to draw samples from a potentially really, really complicated, high dimensional distribution.
It is a technique to sample from a multi-variate distribution, e.g., $p(\mathbf{x}) = p(x_1, \ldots, x_D)$ where each sample is a $D$-dimensional vector $\mathbf{x} \in \mathbb{R}^D$.
Instead of sampling directly from the full joint distribution, Gibbs sampling considers each coordinate in turn and samples a new value from its conditional distribution, e.g.,
$$ p(x_d | x_1,\ldots,x_{d-1},x_{d+1},\ldots,X_D) = p(x_d | \mathbf{x}_{-d}) $$
The multi-variate extension then draws on the observation that MCMC transitions can be combined, either in random or in sequential order, if each individual transition leaves the target density invariant.

Algorithm

Gibbs sampling reduces multidimensional sampling to a sequence of 1d samplings
Here's the algorithm in case of a bivariate distribution
Suppose $x_1, x_2 \sim p(x_1, x_2)$ and we can sample from $p(x_1 \mid x_2)$ and $p(x_2 \mid x_1)$, so these two conditional distributions have to be defined at the beginning
  1. Start by defining $p(x_1 \mid x_2)$ and $p(x_2 \mid x_1)$
  2. Set an initial value $(x_1^0, x_2^0)$
  3. Find the next sample $(x_1^1, x_2^1)$. It will be $$ x_1^1 \sim p(x_1 \mid x_2=x_2^0)\\ x_2^1 \sim p(x_2 \mid x_1=x_1^1) $$ Therefore, the algorithm will consist of the loop $$ \begin{align} \text{for $k=0,1,\ldots$}\\ x_1^{k+1} &\sim p(x_1 \mid x_2=x_2^k)\\ x_2^{k+1} &\sim p(x_2 \mid x_1=x_1^{k+1}) \end{align} $$
We will sample from this distribution without doing it directly

Example

In [1]:
%config IPCompleter.greedy=True
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

p2d = stats.multivariate_normal(mean=[0,0], cov=[[1, 0.6],[0.6, 1]])

X, Y = np.mgrid[-3:3:0.05, -3:3:0.05]
XY = np.empty(X.shape + (2,))
XY[:,:,0] = X; XY[:,:,1] = Y
plt.contour(X, Y, p2d.pdf(XY));
In a bivariate normal example, so that we can compute the conditionals, we can suppose that
$$ \begin{align} \mathbf x = [x_1, x_2] \sim N\Big(0,\begin{bmatrix} 1 & \rho\\ \rho & 1 \end{bmatrix}\Big) \end{align} $$
$\rho$ will be like a correlation between $x1$ and $x2$. These two conditionals were found by applying a theorem that's demonstrated in http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html
$$ x_1 \mid x_2 \sim N(\rho x_2, 1-\rho^2)\\ x_2 \mid x_1 \sim N(\rho x_1, 1-\rho^2) $$
In [2]:
def gibbs_sampling(x0, const, N=100):
    x1, x2 = x0
    samples = []
    samples.append(x0)
    
    for _ in range(N):
        x1 = np.random.normal(const*x2, [1-const*const])
        x2 = np.random.normal(const*x1, [1-const*const])
        samples.append([x1, x2])
    
    return np.array(samples)

x0 = [-3, 3]
const = 0.5
N = 200
samples = gibbs_sampling(x0, const, N)
plt.plot(samples[:, 0], samples[:, 1], 'ro-')

# And this comes from the previous plot
plt.contour(X, Y, p2d.pdf(XY));
plt.show()
In [3]:
from pandas.plotting import autocorrelation_plot
# Plot autocorrelation to ensure samples are independent
autocorrelation_plot(samples)
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x10fd7f588>
This might seem confusing in this example, but the thing is that we're not really sampling using the original distribution but using conditionals from bivariate distributions.
In the end, the stationary distribution will be
$$ \begin{align} \pi(\mathbf x) = N\Big(0,\begin{bmatrix} 1 & \rho\\ \rho & 1 \end{bmatrix} \end{align}\Big) $$