Bayesian Linear Regression Explained

Here I cover the bayesian approach for linear regression and how bayes is used to implement predictors.

Notation

Training set of i.i.d. examples drawn from some unknown distribution. $$ D = \{(x^{(i)}, y^{(i)})\}_{i=1}^{m} $$
Testing set of i.i.d. examples drawn from the same distribution as S. $$ T = \{(x^{(i)}_*, y^{(i)}_*)\}_{i=1}^{m} $$
We have to fit this model $$ \begin{align} y^{(i)} = w^T x^{(i)} + \epsilon^{(i)} \\ \epsilon^{(i)} \sim \mathcal N (0, \sigma_n^2) \space \space i.i.d \end{align} $$
or in matrix form $$ \mathbf y = \mathbf X^T \mathbf w + \epsilon $$

Example

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import math
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 200)

x_real = np.linspace(0, 1, 100, endpoint=True)
y_real = np.sin(2 * math.pi * x_real)
df_real = pd.DataFrame({'x':x_real, 'y':y_real})
plt.plot(x_real,y_real,'g-')

x = np.linspace(0, 1, 10, endpoint=True)
y = [np.sin(2 * math.pi * xi)+  np.random.uniform(-0.3,0.3) for xi in x]
df = pd.DataFrame({'x':x, 'y':y})
plt.plot(x,y,'bo')
Out[1]:
[<matplotlib.lines.Line2D at 0x110c4bef0>]

Deterministic LR

Parametric approach

In a deterministic approach, we model $y$ as a function of an independent variable $x$.
$$y = f(x) + \epsilon$$
For the equation $y = \theta_0 + \theta_1 x + \theta_2 x^2 + \epsilon$, the parameters $\theta_0$, $\theta_1$, and $\theta_2$ were found.
3 parameters needed to be specified!

Linear Regression

To approximate this function I can use a batch LR, which is parametric.
This function below implements $w^T x^{(i)}$
In [2]:
def y_eval(x, w):
    return sum([w[i] * x**i for i in range(len(w))])
And this is the sum of squared errors (SSE)
In [3]:
def E(w,x,y):
    return 1/2 * sum([(y_eval(x[n],w) - y[n])**2 for n in range(len(x))])
Fitting algorithm
In [4]:
def fitting(x, y, M, alpha, min_error, max_iterations):
    N = len(x)
    w = np.random.uniform(0,1,M+1)
    SSE = E(w,x,y)
    print(SSE)
    iterations = 0
    while SSE > min_error and iterations < max_iterations:
        for m in range(M+1):
            # Batch gradient descent
            w[m] = w[m] - alpha * sum([(y_eval(x[n],w)  - y[n]) * x[n] ** m for n in range(N)])
        SSE = E(w,x,y)
        iterations += 1
    print(SSE)
    return w

w = fitting(x, y, 3, 0.01, 0.1, 200000)
y_est = y_eval(x, w)
16.9776963689
0.138661612647
In [5]:
plt.plot(x_real, y_real, 'g-')
plt.plot(x, y, 'bo')
plt.plot(x, y_est,'r-')
Out[5]:
[<matplotlib.lines.Line2D at 0x110da1dd8>]

Evaluating test data

In [6]:
x_new = 0.33
y_eval(x_new, w)
Out[6]:
0.74143602558757626

Generative LR

Non-parametric approach

Now, bayesian linear regression provides a probabilistic approach to this by finding a distribution over the parameters.
There is no need to specify how many parameters!

Bayesian Linear Regression

From
$$ \mathbf y = \mathbf X^T \mathbf w + \epsilon $$
We can obtain the probability distribution of $y$ by summing the probabilities of the right hand side, considering that $\mathbf X^T \mathbf w = \mathcal N (X w^T, 0)$ because it is supposedly the real value of the function, so it has variance $0$.
$$ P(y | X, w) = \mathcal N (X^T w, 0) + \mathcal N (0, \sigma_n^2) $$ $$ P(y | X, w) = \mathcal N (X^T w, \sigma_n^2) $$

Predicting a new "y"

We need to find the posterior; that is, the probability distribution of the $x^*$ for a new entry $y^*$
$$ p(y^*|x^*, X, y) $$
I followed the derivation in http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node2.html. To find it we need $p(y|x^*, w)$ and $p(w|X,y)$ because the integral below needs to be solved

Bayes theorem

There's this assumption or prior $p(w) = \mathcal N (0, \Sigma_p)$.
And there's the likelihood of the data

Taking this into account, we can find $p(w|X,y)$, which according to Bayes theorem is

In the end, $p(w|X,y) = N(w|\mu_w,\Sigma_w) = N(\mu_w,\Sigma_w)$
where

Solution

I still don't understand the part of its derivation where $p(y|x^*,w) = w^Tx^*$

Nonlinear model

By using a mapping function for $X$, bayesian linear regression can work well with nonlinear curves.
$$ p(w|X,y) = N(\mu_w,\Sigma_w) $$
$$ \begin{align} \mu_w = \frac{1}{\sigma_n^2} \Sigma_w^{-1} \Phi y\\ \Sigma_w = (\frac{1}{\sigma_n^2} \Phi \Phi^T + \Sigma_p^{-1})^{-1} \end{align} $$
and
$$ p(y^* | x^*, X, y) = N(\mu_w^T \phi(x^*), \phi(x^*)^T \Sigma_w \phi(x^*)) $$

Common notation

Now, it is easier to understand the notation from the book "Bishop"

It is the one I implemented below, considering the prior for the mean $m_0 = 0$
In [7]:
def phi(x, M):
    return np.array([[x**i for i in range(M+1)]]).T

def matrix_S(x_vec, alpha, beta, M):
    first_expr = alpha*np.eye(M+1)
    second_expr = 0
    for n in range(len(x_vec)):
        second_expr += phi(x_vec[n], M).dot(phi(x_vec[n],M).T)
    return np.linalg.inv(first_expr+beta*second_expr)

def post_mean(x_new, x_vec, target_vec, S, alpha, beta, M):
    sum_vec_x = 0
    for i in range(len(x_vec)):
        sum_vec_x += phi(x_vec[i], M) * target_vec[i]
    return (beta * (phi(x_new, M).T.dot(S)).dot(sum_vec_x)).squeeze()

def post_variance(x_new, S, beta, M):
    return (1.0/beta + phi(x_new, M).T.dot(S).dot(phi(x_new, M))).squeeze()

def posterior(x_new, x_vec, target_vec, S, alpha, beta, M):
    return post_mean(x_new, x_vec, target_vec, S, alpha, beta, M), \
           post_variance(x_new, S, beta, M)
In [8]:
a = phi(2.,2)
np.dot(a, a.T)
Out[8]:
array([[  1.,   2.,   4.],
       [  2.,   4.,   8.],
       [  4.,   8.,  16.]])

Fit and predict

A train function or "fit" in scikit learn will only store the values of x and y in memory because unlike other type of learning algorithm, when having test data to evaluate, this method needs to consider the whole dataset again. As a consequence, "predicting" is a costly process.
Predicting a value
In [9]:
alpha = 5*10**(-3)
beta = 11.1
M = 9

x_new = 0.33

S = matrix_S(x, alpha, beta, M)
mean, variance = posterior(x_new, x, y, S, alpha, beta, M)
mean, variance
Out[9]:
(array(0.7775175284977335), array(0.11953030339926511))
The mean is approximately the true value of y_new

Plotting a confidence region around a expected function

In [10]:
means = [post_mean(x_i, x, y, S, alpha, beta, M) for x_i in x]
variances = [post_variance(x_i, S, beta, M) for x_i in x]
In [11]:
SD = np.sqrt(variances)
upper = means + SD
lower = means - SD

plt.plot(x, y, 'bo')
plt.plot(x_real, y_real, 'g-')
plt.plot(x, means, 'r-')
plt.fill_between(x, upper, lower, color='pink')
Out[11]:
<matplotlib.collections.PolyCollection at 0x1119b8ba8>