Here I cover the bayesian approach for linear regression and how bayes is used to implement predictors.
Testing set of i.i.d. examples drawn from the same distribution as S. $$ T = \{(x^{(i)}_*, y^{(i)}_*)\}_{i=1}^{m} $$
We have to fit this model $$ \begin{align} y^{(i)} = w^T x^{(i)} + \epsilon^{(i)} \\ \epsilon^{(i)} \sim \mathcal N (0, \sigma_n^2) \space \space i.i.d \end{align} $$
or in matrix form $$ \mathbf y = \mathbf X^T \mathbf w + \epsilon $$
$$y = f(x) + \epsilon$$
For the equation $y = \theta_0 + \theta_1 x + \theta_2 x^2 + \epsilon$, the parameters $\theta_0$, $\theta_1$, and $\theta_2$ were found.
3 parameters needed to be specified!
$$ \mathbf y = \mathbf X^T \mathbf w + \epsilon $$
We can obtain the probability distribution of $y$ by summing the probabilities of the right hand side, considering that $\mathbf X^T \mathbf w = \mathcal N (X w^T, 0)$ because it is supposedly the real value of the function, so it has variance $0$.
$$ P(y | X, w) = \mathcal N (X^T w, 0) + \mathcal N (0, \sigma_n^2) $$ $$ P(y | X, w) = \mathcal N (X^T w, \sigma_n^2) $$
$$ p(y^*|x^*, X, y) $$
I followed the derivation in http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node2.html. To find it we need $p(y|x^*, w)$ and $p(w|X,y)$ because the integral below needs to be solved
$$ p(w|X,y) = N(\mu_w,\Sigma_w) $$
$$ \begin{align} \mu_w = \frac{1}{\sigma_n^2} \Sigma_w^{-1} \Phi y\\ \Sigma_w = (\frac{1}{\sigma_n^2} \Phi \Phi^T + \Sigma_p^{-1})^{-1} \end{align} $$
and
$$ p(y^* | x^*, X, y) = N(\mu_w^T \phi(x^*), \phi(x^*)^T \Sigma_w \phi(x^*)) $$
Notation
Training set of i.i.d. examples drawn from some unknown distribution. $$ D = \{(x^{(i)}, y^{(i)})\}_{i=1}^{m} $$Testing set of i.i.d. examples drawn from the same distribution as S. $$ T = \{(x^{(i)}_*, y^{(i)}_*)\}_{i=1}^{m} $$
We have to fit this model $$ \begin{align} y^{(i)} = w^T x^{(i)} + \epsilon^{(i)} \\ \epsilon^{(i)} \sim \mathcal N (0, \sigma_n^2) \space \space i.i.d \end{align} $$
or in matrix form $$ \mathbf y = \mathbf X^T \mathbf w + \epsilon $$
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import math
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 200)
x_real = np.linspace(0, 1, 100, endpoint=True)
y_real = np.sin(2 * math.pi * x_real)
df_real = pd.DataFrame({'x':x_real, 'y':y_real})
plt.plot(x_real,y_real,'g-')
x = np.linspace(0, 1, 10, endpoint=True)
y = [np.sin(2 * math.pi * xi)+ np.random.uniform(-0.3,0.3) for xi in x]
df = pd.DataFrame({'x':x, 'y':y})
plt.plot(x,y,'bo')
Out[1]:
Parametric approach
In a deterministic approach, we model $y$ as a function of an independent variable $x$.$$y = f(x) + \epsilon$$
For the equation $y = \theta_0 + \theta_1 x + \theta_2 x^2 + \epsilon$, the parameters $\theta_0$, $\theta_1$, and $\theta_2$ were found.
3 parameters needed to be specified!
In [2]:
def y_eval(x, w):
return sum([w[i] * x**i for i in range(len(w))])
And this is the sum of squared errors (SSE)
In [3]:
def E(w,x,y):
return 1/2 * sum([(y_eval(x[n],w) - y[n])**2 for n in range(len(x))])
Fitting algorithm
In [4]:
def fitting(x, y, M, alpha, min_error, max_iterations):
N = len(x)
w = np.random.uniform(0,1,M+1)
SSE = E(w,x,y)
print(SSE)
iterations = 0
while SSE > min_error and iterations < max_iterations:
for m in range(M+1):
# Batch gradient descent
w[m] = w[m] - alpha * sum([(y_eval(x[n],w) - y[n]) * x[n] ** m for n in range(N)])
SSE = E(w,x,y)
iterations += 1
print(SSE)
return w
w = fitting(x, y, 3, 0.01, 0.1, 200000)
y_est = y_eval(x, w)
In [5]:
plt.plot(x_real, y_real, 'g-')
plt.plot(x, y, 'bo')
plt.plot(x, y_est,'r-')
Out[5]:
In [6]:
x_new = 0.33
y_eval(x_new, w)
Out[6]:
Bayesian Linear Regression
From$$ \mathbf y = \mathbf X^T \mathbf w + \epsilon $$
We can obtain the probability distribution of $y$ by summing the probabilities of the right hand side, considering that $\mathbf X^T \mathbf w = \mathcal N (X w^T, 0)$ because it is supposedly the real value of the function, so it has variance $0$.
$$ P(y | X, w) = \mathcal N (X^T w, 0) + \mathcal N (0, \sigma_n^2) $$ $$ P(y | X, w) = \mathcal N (X^T w, \sigma_n^2) $$
Predicting a new "y"
We need to find the posterior; that is, the probability distribution of the $x^*$ for a new entry $y^*$$$ p(y^*|x^*, X, y) $$
I followed the derivation in http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node2.html. To find it we need $p(y|x^*, w)$ and $p(w|X,y)$ because the integral below needs to be solved
Nonlinear model
By using a mapping function for $X$, bayesian linear regression can work well with nonlinear curves.$$ p(w|X,y) = N(\mu_w,\Sigma_w) $$
$$ \begin{align} \mu_w = \frac{1}{\sigma_n^2} \Sigma_w^{-1} \Phi y\\ \Sigma_w = (\frac{1}{\sigma_n^2} \Phi \Phi^T + \Sigma_p^{-1})^{-1} \end{align} $$
and
$$ p(y^* | x^*, X, y) = N(\mu_w^T \phi(x^*), \phi(x^*)^T \Sigma_w \phi(x^*)) $$
In [7]:
def phi(x, M):
return np.array([[x**i for i in range(M+1)]]).T
def matrix_S(x_vec, alpha, beta, M):
first_expr = alpha*np.eye(M+1)
second_expr = 0
for n in range(len(x_vec)):
second_expr += phi(x_vec[n], M).dot(phi(x_vec[n],M).T)
return np.linalg.inv(first_expr+beta*second_expr)
def post_mean(x_new, x_vec, target_vec, S, alpha, beta, M):
sum_vec_x = 0
for i in range(len(x_vec)):
sum_vec_x += phi(x_vec[i], M) * target_vec[i]
return (beta * (phi(x_new, M).T.dot(S)).dot(sum_vec_x)).squeeze()
def post_variance(x_new, S, beta, M):
return (1.0/beta + phi(x_new, M).T.dot(S).dot(phi(x_new, M))).squeeze()
def posterior(x_new, x_vec, target_vec, S, alpha, beta, M):
return post_mean(x_new, x_vec, target_vec, S, alpha, beta, M), \
post_variance(x_new, S, beta, M)
In [8]:
a = phi(2.,2)
np.dot(a, a.T)
Out[8]:
Predicting a value
In [9]:
alpha = 5*10**(-3)
beta = 11.1
M = 9
x_new = 0.33
S = matrix_S(x, alpha, beta, M)
mean, variance = posterior(x_new, x, y, S, alpha, beta, M)
mean, variance
Out[9]:
The mean is approximately the true value of y_new
In [10]:
means = [post_mean(x_i, x, y, S, alpha, beta, M) for x_i in x]
variances = [post_variance(x_i, S, beta, M) for x_i in x]
In [11]:
SD = np.sqrt(variances)
upper = means + SD
lower = means - SD
plt.plot(x, y, 'bo')
plt.plot(x_real, y_real, 'g-')
plt.plot(x, means, 'r-')
plt.fill_between(x, upper, lower, color='pink')
Out[11]: