Sampling: Markov Chains


Before we get into the meat of the subject, let's breakdown the term Markov Chain Monte Carlo (MCMC) into its basic components: Monte Carlo methods and Markov chains.

Markov Chains

A Markov Chain is a random process that undergoes transitions from one state to another in a state space. It is best described as a finite state machine where the transitions represent probabilities of going from one state to another.

A Markov model can describe a sequence of random variables $X_0, X_1, \ldots, X_n, \ldots$. And they all are drawn from a distribution where the value of $X_n$ depends only on the immediate variable $X_{n-1}$.
$$ p(X_{n+1} = j \mid X_n = i, X_{n-1} = i_{n-1}, \ldots, X_0 = i_0) = p(X_{n+1} = j \mid X_{n} = i)$$
We can see the probabilities to transition from one state to another $$ T(S\to S) = 0.9\\ T(R\to S) = 0.5 $$
Now, if we start at state S, we will have the next initial probabilities
$P(X_{0}=S) = 1$ and $P(X_{0}=R) = 0$
And when we want to see the probability of the state being S at the next timestamp, we can compute $P(X_{1}=S)$
$$ \begin{align} P(X_{1}=S) &= P(X_{1}=S \mid X_{0}=S) P(X_{0}=S) + P(X_{1}=S \mid X_{0}=R) P(X_{0}=R)\\ P(S) &= P(S \mid S) P(S) + P(S \mid R) P(R) = 0.9*1 + 0.5*0 =0.9 \end{align} $$

Autocomplete

This is a simple application that will predict the next word of a user's query.
Naturally, this will involve dealing with probabilities. Here, Markov Model is our stochastic tool of choice.

Markov model

Because the probability of a future state depends only on the present state (that's another way to see it), each state in our Markov chain will represent a unique word, and we only need to collect bigrams.
In [27]:
class MarkovChain:

    def __init__(self):
        self.memory = {}

    def _learn_key(self, key, value):
        if key not in self.memory:
            self.memory[key] = []
        self.memory[key].append(value)

    def learn(self, text):
        tokens = text.split(" ")
        bigrams = [(tokens[i], tokens[i + 1]) for i in range(0, len(tokens) - 1)]
        for bigram in bigrams:
            self._learn_key(bigram[0], bigram[1])

m = MarkovChain()
m.learn('I am Sam. Sam I am. I do not like green eggs and ham.')
print(m.memory)
{'I': ['am', 'am.', 'do'], 'am': ['Sam.'], 'Sam.': ['Sam'], 'Sam': ['I'], 'am.': ['I'], 'do': ['not'], 'not': ['like'], 'like': ['green'], 'green': ['eggs'], 'eggs': ['and'], 'and': ['ham.']}
There might be better implementations for this memory, but that's the idea.
If we were to use data to predict a word that follows the word "I" we have three choices and each of them has the same probability (1/3) of being a valid choice.
We can add additional transitions to our Chain by considering additional bigrams starting with "am", "am.", and "do". In each case, there is only one possible choice for the next state in our Markov Chain, given the bigrams we know from our input text. Each transition from one of these states, therefore, has a 1.0 probability.

Predicting the next word

Doing this has to be as simple as randomly choosing between them
In [36]:
import random
random.sample(['am', 'am.', 'do'], 1)
Out[36]:
['am.']
However, the transition probabilities between states naturally become weighted as we learn more text.
For example, in the following sequence, we learn a few sentences with the same bigrams, and in the final state, we are twice as likely to choose am as the next word following "I" by randomly sampling from the next possible states.
In [58]:
m2 = MarkovChain()
m2.learn('I am Sam.')
print(m2.memory)

m2.learn('I am Kevin.')
print(m2.memory)

m2.learn('I do.')
print(m2.memory)
{'I': ['am'], 'am': ['Sam.']}
{'I': ['am', 'am'], 'am': ['Sam.', 'Kevin.']}
{'I': ['am', 'am', 'do.'], 'am': ['Sam.', 'Kevin.']}
In [129]:
class MarkovChain:

    def __init__(self):
        self.memory = {}

    def _learn_key(self, key, value):
        if key not in self.memory:
            self.memory[key] = []

        self.memory[key].append(value)

    def learn(self, text):
        tokens = text.split(" ")
        bigrams = [(tokens[i], tokens[i + 1]) for i in range(0, len(tokens) - 1)]
        for bigram in bigrams:
            self._learn_key(bigram[0], bigram[1])

    def _next(self, current_state):
        next_possible = self.memory.get(current_state)

        # If there's no pre-defined initial state or if there's no next state, choose any
        if not next_possible:
            next_possible = self.memory.keys()

        return random.sample(next_possible, 1)[0]

    def babble(self, num_states, state=''):
        if not num_states:
            return state

        next_word = self._next(state)
        
        output = state
        if state != '':
            output += ' '
        output += self.babble(num_states - 1, next_word)
        
        return output
In [156]:
m2 = MarkovChain()
m2.learn('I am Sam.')
m2.learn('I am Kevin.')
m2.learn('I do.')

m2.babble(100, state='I')
Out[156]:
'I am Kevin. am Kevin. am Sam. I am Kevin. am Sam. am Sam. am Kevin. am Sam. I do. am Sam. am Sam. I do. am Kevin. I am Sam. am Kevin. am Kevin. I am Sam. am Sam. am Sam. I do. am Sam. am Kevin. I am Kevin. am Kevin. I am Sam. am Sam. I do. I am Kevin. I am Kevin. I am Kevin. am Sam. am Sam. I am Kevin. am Sam. I am Kevin. am Kevin. am Kevin. am Kevin. am Kevin. am Sam. am Sam. I am Sam. I am Kevin. I am'

Markov Chains using matrix notation


A way to represent that finite state machine is via a transition matrix P.
\begin{align} P = \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \\ \end{bmatrix} \end{align}
Suppose we start in a sunny state. We can represent that as the row vector:
$${\bf x}^{(0)} = \begin{bmatrix}1 & 0\end{bmatrix}$$
Then, the next state will be
$$ \begin{align} {\bf x}^{(1)} = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \\ \end{bmatrix} \end{align} $$
In that way, we can compute $x^{(k)}$ for various timestamps
$$ \begin{align} {\bf x}^{(k)} = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \\ \end{bmatrix}^k \end{align} $$
In [1]:
import numpy as np
from numpy import linalg as LA

def simulate_markov(x_0, P, k):
    for i in range(k):
        P_k = LA.matrix_power(P, i)
        x_k = np.dot(x_0, P_k)
        print("x^(%d) = [%.4f %.4f]" % (i, x_k[0], x_k[1]))
        
P = np.array([[0.9, 0.1], [0.5, 0.5]])
istate = np.array([1, 0])
        
simulate_markov(istate, P, 15)
x^(0) = [1.0000 0.0000]
x^(1) = [0.9000 0.1000]
x^(2) = [0.8600 0.1400]
x^(3) = [0.8440 0.1560]
x^(4) = [0.8376 0.1624]
x^(5) = [0.8350 0.1650]
x^(6) = [0.8340 0.1660]
x^(7) = [0.8336 0.1664]
x^(8) = [0.8334 0.1666]
x^(9) = [0.8334 0.1666]
x^(10) = [0.8334 0.1666]
x^(11) = [0.8333 0.1667]
x^(12) = [0.8333 0.1667]
x^(13) = [0.8333 0.1667]
x^(14) = [0.8333 0.1667]
We can see an interesting phenomenon here where the probability of being sunny or rainy seem to converge as we take more steps in our state machine. You might think it has something to do with the initial state we're in but in fact, it doesn't. We'll get the same result if we initialize the initial state to something random:
In [2]:
r = np.random.rand()
simulate_markov(np.array([r, 1 - r]), P, 15)
x^(0) = [0.1619 0.8381]
x^(1) = [0.5648 0.4352]
x^(2) = [0.7259 0.2741]
x^(3) = [0.7904 0.2096]
x^(4) = [0.8161 0.1839]
x^(5) = [0.8265 0.1735]
x^(6) = [0.8306 0.1694]
x^(7) = [0.8322 0.1678]
x^(8) = [0.8329 0.1671]
x^(9) = [0.8332 0.1668]
x^(10) = [0.8333 0.1667]
x^(11) = [0.8333 0.1667]
x^(12) = [0.8333 0.1667]
x^(13) = [0.8333 0.1667]
x^(14) = [0.8333 0.1667]

Stationary Distribution

This steady state distribution is called a stationary distribution usually denoted by $\pi$.
It is called stationary if
$$ \pi(x') = \sum_x T(x \to x') \pi(x) $$
This means that it is stationary if you start from $\pi$, and then after a transition, you get to the distribution $\pi$ again.
There is a theorem that tells us that if the transition probability for any two states is positive $T(x \to x') > 0$. Then there is a unique $\pi$ that is stationary, and we will converge to it from any starting point. So, we will eventually get samples from that distribution $\pi$
This steady state vector $\pi$ can be found in several ways. The most straightforward is by taking the limit as $n$ approaches infinity.
$$ {\bf q} = lim_{n \rightarrow \infty} {\bf x}^{(n)} $$
Since $\bf q$ by definition is the steady state, then multiplying by $P$ should give the same value back:
\begin{align} {\bf \pi}P = {\bf \pi} \\ {\bf \pi}(P - I) = {\bf 0} \end{align}
Not every Markov Chain has a stationary distribution or even a unique one. But we can guarantee these properties if we add two additional constraints to the Markov Chain:
  1. Irreducible: we must be able to reach any one state from any other state eventually (i.e., the expected number of steps is finite).
  2. Aperiodic: the system never returns to the same state with a fixed period (e.g., not returning to start "sunny" deterministically every 5 steps).