Sampling: Metropolis Hastings


  • Gibbs Sampling converges slowly and generates samples too correlated
  • MH produces not so correlated samples
  • MH applies rejection and the sampling idea to Markov Chains

Metropolis-Hastings Algorithm - Part 1

  1. Initialize the initial state by picking a random $x$
  2. Sample $x'$ from a "proposal distribution" $Q(x \to x')$. For example, it can be $Q(x \to x') = N(x, 1)$.
  3. Accept or reject that $x'$?
To see if we'll reject that point, we'll define a critic $A$, and we'll accept it with probability $A(x \to x')$.
$T (x \to x') = Q(x \to x') A(x \to x')$ for all $x \neq x'$.
$Q$ that we will try to fix by using a "critic" $A$, and that will give us a Markov chain $T$ that we'll need in order to find the stationary distribution $\pi$.

Derivating the critic

We first start with our end goal: creating a Markov Chain where the steady state distribution is equal to our target distribution.
This is the detailed balance or reversibility condition. $$ \pi(x) T(x \to x') = \pi(x') T(x' \to x) $$
Replacing $T$
$$ \begin{align} \pi(x) Q(x \to x') A(x \to x') &= \pi(x') Q(x' \to x) A(x' \to x) \\ \frac{A(x \to x')}{A(x' \to x)} &= \frac{\pi(x') Q(x' \to x)}{\pi(x) Q(x \to x')} \end{align} $$
This can be summarized in the following expression
$$ \begin{equation*} A(x \to x') = \begin{cases} \frac{\pi(x') Q(x' \to x)}{\pi(x) Q(x \to x')} &\text{if $\pi(x') Q(x' \to x) < \pi(x) Q(x \to x')$}\\ 1 &\text{otherwise} \end{cases} \end{equation*} $$
or equivalently $$ A(x \to x') = \min\left\{1, \frac{\pi(x') Q(x' \to x)}{\pi(x) Q(x \to x')}\right\} $$
I know it isn't too clear, and it probably doesn't need to though I may update this when I finally understand this.

Metropolis-Hastings Algorithm - Part 2

Now we can complete the algorithm
Initialize the initial state by picking a random $x$
For k = 1,2, ...
  1. Sample $x'$ from a "proposal distribution" Q$(x^k \to x')$
  2. Accept proposal $x'$ with probability according to a "critic" $A(x^k \to x')$. If it's accepted, then transition to $x'$
  3. Otherwise stay at state $x^k$ $$ x^{k+1} = x^{k} $$

Symmetric distribution $Q$

Note that the acceptance condition simplifies when the proposal distribution is symmetric, i.e. $Q(x' \to x) = Q(x \to x')$. In this case, the algorithm is also known as the Metropolis algorithm with the intuitive acceptance probability $\min\big\{1, \frac{\pi(x')}{\pi(x)}\big\}$.
The algorithm also works if the normalization constant of $\pi(x)$ is unavailable as it cancels in the acceptance condition.

Example 1

This generates samples from a digamma function, like in rejection sampling
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

# Target = double gamma distribution
dg = stats.dgamma(a=1)

# Generate samples for PDF
x = np.linspace(dg.ppf(0.001), dg.ppf(0.999), 1000)
dg_samples = dg.pdf(x)

# Plot
df = pd.DataFrame({'Target': dg_samples}, index=x)
ax = df.plot(style=['-'], color=['blue'], 
             figsize=(8,6), linewidth=2.0)

Implementing MH

Proposal distribution: It will be a normal distribution $N(x, .5)$
$$ Q(x \to x') = N(x, 2) = f_{N(x, .5)} (x) $$
We also need to input $\pi(x)$ that might be proportional to our underlying distribution $p(\mathbf x)$, the critic or accepting distribution will look like this:
$$ \begin{align} A(x \rightarrow x') = min\Big\{1, \frac{\pi(x')f_{N(x',.5)}(x)}{\pi(x)f_{N(x,.5)}(x')}\Big\} = min\Big\{1, \frac{\pi(x')}{\pi(x)}\Big\} \end{align} $$
And we choose $x'$ according to a conditional $A(x \rightarrow x') > Threshold$
Intuitively, that threshold can be simply $0.5$, but $U(0, 1)$ seems a better value for it.
In [2]:
def mh_sampler(x0, pi, N=100):
    x_curr = x0
    samples = []
    samples.append(x_curr)
    
    for i in range(N):
        x_next = np.random.normal(x_curr, 0.5)
        if min(1, pi(x_next) / pi(x_curr)) > np.random.uniform(0, 1):
            x_curr = x_next
            samples.append(x_next)
    return np.array(samples)

x0 = np.random.uniform(0, 1)  # any value between 0 and 1

# Proportional to the double gamma distribution
# (because it's the one we're trying to approximate samples from)
pi = lambda x: dg.pdf(x) * np.pi

samples = mh_sampler(x0, pi, 20000)

# Plot Histogram vs. Target PDF
df['Target'].plot(color='blue', style='--', figsize=(8,6), linewidth=2.0)
pd.Series(samples).hist(bins=300, normed=True, color='green', 
                        alpha=0.3, linewidth=0.0)
plt.legend(['Target PDF', 'MH Sampling'])
plt.show()
In [3]:
from pandas.plotting import autocorrelation_plot
# Plot autocorrelation to ensure samples are independent
autocorrelation_plot(pd.Series(samples))
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x103d93940>
As we can see visually, the samples from our MH sampler are a good approximation of the double gamma distribution. Additionally, looking at the autocorrelation plot, we can see that it's quite small across our entire sample, indicating that they are relatively independent.
If $N$ isn't big enough, the samples can be more correlated, which is what we don't want. This correlation also depends on $Q$.

Burn-In for choosing the initial value

Since we randomly pick an initial value for $x$, it's quite possible that it is in a region where $p(x)$ is quite small (think of the tail ends of our double gamma distribution). If it starts here, it might spend a disproportionate amount of time traversing through $x$ values with low density.
So the solution to this is to "burn-in" the sampler by generating a bunch of samples and throwing them away first.

Burn-In for generating less correlated samples

Since by definition of our transition function $T(x \to x')$, drawing $x'$ is dependent on the current state $x$. Thus, we lose one vital property of our samples: independence.
To correct this, we can draw $N_b$ samples and only record the last one drawn. Assuming $N_b$ is large enough, the samples should be relatively independent.

Example 2

Here's a complete MH implementation
It uses logarithms in the part of the conditional
$$ \log A(x \to x') = \min\Big\{1, \log(\pi(x') Q(x' \to x)) - \log(\pi(x) Q(x \to x'))\Big\} $$
Proposal distribution: It will be a multivariate normal
$$ Q(x \to x') = N\Big(x,\begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix}\Big) $$
We also need to input $\pi(x)$
$$ \pi(x) = N\Big(x,\begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix}\Big) $$
And the critic or accepting distribution will be calculated in the same way. And yes, those $Q$ are also symmetric in this case.
And it will try to sample from the same 2d normal distribution as in the Gibbs Sampling notes
In [4]:
p2d = stats.multivariate_normal(mean=[0,0], cov=[[1, 0.6],[0.6, 1]])

X, Y = np.mgrid[-3:3:0.05, -3:3:0.05]
XY = np.empty(X.shape + (2,))
XY[:,:,0] = X; XY[:,:,1] = Y
plt.contour(X, Y, p2d.pdf(XY));
In [5]:
class Sampling (object):
    """
    Abstract base class for all sampling methods.
    
    Subclasses need to implement self.sample()
    """
    def sample(self):
        pass
    
    def __str__(self):
        """
        Default is to show class
        """
        return str(self.__class__)
    
class InversionSampling (Sampling):
    def __init__(self, h_inv):
        self.h_inv = h_inv
        
    def sample(self):
        return self.h_inv(np.random.uniform())
In [6]:
class Proposer (object):
    """
    Wraps two functions needed by a proposer Q, i.e.
      Draw a new value y ~ Q.propose(x)
      Compute log transition probability Q.log_trans_prob(x,y) 
    """
    def __init__(self, propose, log_trans_prob):
        self.propose = propose
        self.log_trans_prob = log_trans_prob
    
    def propose(self, x):
        return self.propose(x)
    
    def log_trans_prob(self, x, x_prime):
        return self.log_trans_prob(x, x_prime)

class MetropolisHastings (Sampling):
    def __init__(self, log_p, q, x):
        """
        q is assumed to be a proposer and log_p computes log p(x)
        """
        self.x = x # Current sample
        self.log_p = log_p
        self.q = q
        self.samples = 0
        self.accepted = 0
        
    def __str__ (self):
        return "Metropolis Hastings: Accepted %d out of %d samples" % (self.accepted, self.samples)
        
    def sample (self):
        self.samples += 1
        # Propose new candidate
        x_prime = self.q.propose(self.x)
        A = self.log_p(x_prime) + self.q.log_trans_prob(x_prime, self.x) \
            - self.log_p(self.x) - self.q.log_trans_prob(self.x, x_prime)

        if A > np.log(np.random.uniform()):
            self.accepted += 1
            self.x = x_prime
        return self.x
In [7]:
x0 = [-3, 3]

sampling = MetropolisHastings(log_p=p2d.logpdf, \
                              q=Proposer(lambda x: multivariate_normal(mean=x, cov=[[1,0.5],[0.5,1]]).rvs(),
                                         lambda x,x_prime: multivariate_normal(mean=x, cov=[[1,0.5],[0.5,1]]).logpdf(x)), # Proposal is actually symmetric
                              x=x0)

samples = [x0]
N = 200
for _ in range(N):
    samples = np.vstack([samples, sampling.sample()])

plt.plot(samples[:,0], samples[:,1], 'ro-')
plt.contour(X, Y, p2d.pdf(XY))
Out[7]:
<matplotlib.contour.QuadContourSet at 0x11891eb38>
In [8]:
from pandas.plotting import autocorrelation_plot
# Plot autocorrelation to ensure samples are independent
autocorrelation_plot(samples)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x11891e588>

MH Algorithm summary

Here I'll change the notation. This expression $$ \frac{\pi(x') Q(x' \to x)}{\pi(x) Q(x \to x')} $$
can be written as $$ \frac{\pi(x') q(x|x')}{\pi(x) q(x'|x)} $$
And being more practical, we will mostly want to approximate parameters $\theta$
$$ \frac{\pi(\theta') q(\theta|\theta')}{\pi(\theta) q(\theta'|\theta)} $$
$p$ can be seen as a prior and $q$ as a likelihood for each case, and maybe like this, it is easier to understand the idea behind "the critic." We have this ratio
$$ \text{Ratio} = \frac{\text{Posterior probability of $\theta'$}}{\text{Posterior probability of $\theta$}} $$

(1) If the NUMERATOR is greater than the DENOMINATOR, then we will accept the new $\theta'$

(2) Otherwise, we will not necessarily discard the new $\theta'$, but we can threat ratios less than $1$ as an acceptance probability $\alpha$, then use a random number from a uniform distribution to see if we accept it or not.

This will be the acceptance probability that considers both cases implicitly
$$\alpha(\theta',\theta) = \min\left\{1, \frac{\pi(\theta') q(\theta|\theta')}{\pi(\theta) q(\theta'|\theta)}\right\}$$
That is because if we do this $u \sim \cal U_{(0,1)}$
$u < \alpha(\theta',\theta)$ will always be true in case (1)
and will be true also when $\text{Ratio} < 1$ and what happens in case(2) happens
So, the full algorithm should look as follows

Inference

The final goal of using Monte Carlo is not only to approximate probability distributions $p(\theta)$ but also to make inferences with them. It was because we could not compute this full inference.
$$ p(\theta \mid y) = \frac{p(y \mid \theta)p(\theta)}{p(y)} $$
The thing is that the stationary distribution $\pi$ is the distribution we actually find in the end, which is somehow $p(\theta|y)$. Maybe because in real datasets we actually define $\pi(\theta)$ according to some data $y$, and not as in the second example
$$ \pi(\theta) = N\Big(\theta,\begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix}\Big) $$
We are simulating a Markov chain $\{\theta_k\}^N_{k=1}$ and
$$ \pi(\theta) = p(\theta \mid y) = \frac{p(y \mid \theta)p(\theta)}{p(y)} $$