Here I cover the bayesian approach for linear regression and how bayes is used to implement predictors.
Testing set of i.i.d. examples drawn from the same distribution as S. T = \{(x^{(i)}_*, y^{(i)}_*)\}_{i=1}^{m}
We have to fit this model \begin{align} y^{(i)} = w^T x^{(i)} + \epsilon^{(i)} \\ \epsilon^{(i)} \sim \mathcal N (0, \sigma_n^2) \space \space i.i.d \end{align}
or in matrix form \mathbf y = \mathbf X^T \mathbf w + \epsilon
\mathbf y = \mathbf X^T \mathbf w + \epsilon
We can obtain the probability distribution of y by summing the probabilities of the right hand side, considering that \mathbf X^T \mathbf w = \mathcal N (X w^T, 0) because it is supposedly the real value of the function, so it has variance 0.
P(y | X, w) = \mathcal N (X^T w, 0) + \mathcal N (0, \sigma_n^2)
p(y^*|x^*, X, y)
I followed the derivation in http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node2.html. To find it we need p(y|x^*, w) and p(w|X,y) because the integral below needs to be solved
p(w|X,y) = N(\mu_w,\Sigma_w)
\begin{align} \mu_w = \frac{1}{\sigma_n^2} \Sigma_w^{-1} \Phi y\\ \Sigma_w = (\frac{1}{\sigma_n^2} \Phi \Phi^T + \Sigma_p^{-1})^{-1} \end{align}
and
p(y^* | x^*, X, y) = N(\mu_w^T \phi(x^*), \phi(x^*)^T \Sigma_w \phi(x^*))
Notation
Training set of i.i.d. examples drawn from some unknown distribution. D = \{(x^{(i)}, y^{(i)})\}_{i=1}^{m}Testing set of i.i.d. examples drawn from the same distribution as S. T = \{(x^{(i)}_*, y^{(i)}_*)\}_{i=1}^{m}
We have to fit this model \begin{align} y^{(i)} = w^T x^{(i)} + \epsilon^{(i)} \\ \epsilon^{(i)} \sim \mathcal N (0, \sigma_n^2) \space \space i.i.d \end{align}
or in matrix form \mathbf y = \mathbf X^T \mathbf w + \epsilon
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import math
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 200)
x_real = np.linspace(0, 1, 100, endpoint=True)
y_real = np.sin(2 * math.pi * x_real)
df_real = pd.DataFrame({'x':x_real, 'y':y_real})
plt.plot(x_real,y_real,'g-')
x = np.linspace(0, 1, 10, endpoint=True)
y = [np.sin(2 * math.pi * xi)+ np.random.uniform(-0.3,0.3) for xi in x]
df = pd.DataFrame({'x':x, 'y':y})
plt.plot(x,y,'bo')
Out[1]:
In [2]:
def y_eval(x, w):
return sum([w[i] * x**i for i in range(len(w))])
And this is the sum of squared errors (SSE)
In [3]:
def E(w,x,y):
return 1/2 * sum([(y_eval(x[n],w) - y[n])**2 for n in range(len(x))])
Fitting algorithm
In [4]:
def fitting(x, y, M, alpha, min_error, max_iterations):
N = len(x)
w = np.random.uniform(0,1,M+1)
SSE = E(w,x,y)
print(SSE)
iterations = 0
while SSE > min_error and iterations < max_iterations:
for m in range(M+1):
# Batch gradient descent
w[m] = w[m] - alpha * sum([(y_eval(x[n],w) - y[n]) * x[n] ** m for n in range(N)])
SSE = E(w,x,y)
iterations += 1
print(SSE)
return w
w = fitting(x, y, 3, 0.01, 0.1, 200000)
y_est = y_eval(x, w)
In [5]:
plt.plot(x_real, y_real, 'g-')
plt.plot(x, y, 'bo')
plt.plot(x, y_est,'r-')
Out[5]:
In [6]:
x_new = 0.33
y_eval(x_new, w)
Out[6]:
Bayesian Linear Regression
From\mathbf y = \mathbf X^T \mathbf w + \epsilon
We can obtain the probability distribution of y by summing the probabilities of the right hand side, considering that \mathbf X^T \mathbf w = \mathcal N (X w^T, 0) because it is supposedly the real value of the function, so it has variance 0.
P(y | X, w) = \mathcal N (X^T w, 0) + \mathcal N (0, \sigma_n^2)
P(y | X, w) = \mathcal N (X^T w, \sigma_n^2)
Predicting a new "y"
We need to find the posterior; that is, the probability distribution of the x^* for a new entry y^*p(y^*|x^*, X, y)
I followed the derivation in http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node2.html. To find it we need p(y|x^*, w) and p(w|X,y) because the integral below needs to be solved

Nonlinear model
By using a mapping function for X, bayesian linear regression can work well with nonlinear curves.p(w|X,y) = N(\mu_w,\Sigma_w)
\begin{align} \mu_w = \frac{1}{\sigma_n^2} \Sigma_w^{-1} \Phi y\\ \Sigma_w = (\frac{1}{\sigma_n^2} \Phi \Phi^T + \Sigma_p^{-1})^{-1} \end{align}
and
p(y^* | x^*, X, y) = N(\mu_w^T \phi(x^*), \phi(x^*)^T \Sigma_w \phi(x^*))
In [7]:
def phi(x, M):
return np.array([[x**i for i in range(M+1)]]).T
def matrix_S(x_vec, alpha, beta, M):
first_expr = alpha*np.eye(M+1)
second_expr = 0
for n in range(len(x_vec)):
second_expr += phi(x_vec[n], M).dot(phi(x_vec[n],M).T)
return np.linalg.inv(first_expr+beta*second_expr)
def post_mean(x_new, x_vec, target_vec, S, alpha, beta, M):
sum_vec_x = 0
for i in range(len(x_vec)):
sum_vec_x += phi(x_vec[i], M) * target_vec[i]
return (beta * (phi(x_new, M).T.dot(S)).dot(sum_vec_x)).squeeze()
def post_variance(x_new, S, beta, M):
return (1.0/beta + phi(x_new, M).T.dot(S).dot(phi(x_new, M))).squeeze()
def posterior(x_new, x_vec, target_vec, S, alpha, beta, M):
return post_mean(x_new, x_vec, target_vec, S, alpha, beta, M), \
post_variance(x_new, S, beta, M)
In [8]:
a = phi(2.,2)
np.dot(a, a.T)
Out[8]:
Predicting a value
In [9]:
alpha = 5*10**(-3)
beta = 11.1
M = 9
x_new = 0.33
S = matrix_S(x, alpha, beta, M)
mean, variance = posterior(x_new, x, y, S, alpha, beta, M)
mean, variance
Out[9]:
The mean is approximately the true value of y_new
In [10]:
means = [post_mean(x_i, x, y, S, alpha, beta, M) for x_i in x]
variances = [post_variance(x_i, S, beta, M) for x_i in x]
In [11]:
SD = np.sqrt(variances)
upper = means + SD
lower = means - SD
plt.plot(x, y, 'bo')
plt.plot(x_real, y_real, 'g-')
plt.plot(x, means, 'r-')
plt.fill_between(x, upper, lower, color='pink')
Out[11]: