Learning some ML approaches based on probability theory, here are my notes on learning about Maximum A Posteriori (MAP).
$$P(d^{(j)} | d^{(j-1)}, d^{(j-2)}, \ldots) = P(d^{(j)})$$
And each sample has an identically prior probability.
$$P(d^{(j)}) = P(d^{(j-1)}) = P(d^{(j-2)}) = \ldots$$
The main concepts here are data and the hypothesis.
We call MAP to the prediction based on the most probable hypothesis. In the following example, I explain the way how a prediction is made.
Candies are found in wrappers of the same color regardless of the flavor. These candies are sold in bags that contains 10, the bag can be one of 5 types:
$$ P(\text{The bag's type is 1}) = 0.1\\ P(\text{The bag's type is 2}) = 0.2\\ P(\text{The bag's type is 3}) = 0.4\\ P(\text{The bag's type is 4}) = 0.2\\ P(\text{The bag's type is 5}) = 0.1\\ $$
If we buy a bag, how can we know what type is it?
Observations are i.i.d.
Something we can do to solve this problem is to compute the probabilities that the bag is of type $i$, while we are opening candies from the bag. We call this "maximum a posteriori (MAP) prediction".
We define the random variable $H$ (for hypothesis) that denotes the declaration of the bag's type, then $dom(H) = \{h_1, ..., h_5\}$
According to the statement, we got the a priori distribution over $H$ which is $[0.1, 0.2, 0.4, 0.2, 0.1]$
Then the random variable $D$ that denotes the kind of candy, then $dom(D) = \{cherry, lime\}$, we assume that it's Uniformly Distributed.
$d^{(j)}$ will be the observed value at time $j$, $d$ will be the value of the observed candy.
From the statement we have:
$$ P(d = lime | h_1) = 0\\ P(d = lime | h_2) = 0.25\\ P(d = lime | h_3) = 0.50\\ P(d = lime | h_4) = 0.75\\ P(d = lime | h_4) = 1.0 $$
Given that the data is i.i.d., then $P(d = lime | h_i) = P(d^{(1)} = lime | h_i) = P(d^{(2)} = lime | h_i) = \ldots$.
Then we already have $P(H)$ and $P(D | H)$.
We want to find the probability of each hypothesis, for example, that the bag's type is 1 after opening a candy and see that it's lime:
$$P(H = h_1 | d^{(1)} = lime)$$
We could use bayesian learning to compute the probability of each hypothesis given the type of the first candy.
$$h_{MAP} = \arg\max_{H} P(H | d)$$$$ P(h_i | d) = \frac{P(d | h_i)P(h_i)}{P(d)}\\ P(h_i | d) = \alpha P(d | h_i)P(h_i) $$
Furthermore, the likelihood of the data with respect to a given hypothesis is:
$$ P(d | h_i) = \prod\limits_{j} P(d^{(j)} | h_i) $$
Actually, we shouldn't care about $\alpha$ if we wanted to find the most probable hypothesis (the maximum probability), but because we want to get a minimum probability of 0.6, we'll apply normalization.
@book{montgomery2010applied, title={Applied statistics and probability for engineers}, author={Montgomery, Douglas C and Runger, George C}, year={2010}, publisher={John Wiley \& Sons} }
Maximum A Posteriori (MAP)
First, I have to mention some definitions:- Hypothesis, it's a declaration about parameters of a probability distribution, or about a model, or about the shape of a probability distribution.
- i.i.d., the observations are Independent and Indentically Distributed or the probability distributions about the samples $d$ stays the same over the time (stationary assumption), this means:
$$P(d^{(j)} | d^{(j-1)}, d^{(j-2)}, \ldots) = P(d^{(j)})$$
And each sample has an identically prior probability.
$$P(d^{(j)}) = P(d^{(j-1)}) = P(d^{(j-2)}) = \ldots$$
- Likelihood, is the probability of observing $N_d$ values of a random variable $d^{(1)}, d^{(2)}, d^{(3)}, \ldots, d^{(N_d)}$ given an hypothesis $h_\theta$ (normally these observations are i.i.d.).
The main concepts here are data and the hypothesis.
We call MAP to the prediction based on the most probable hypothesis. In the following example, I explain the way how a prediction is made.
Example 1:
We have candies of two flavors: cherry and lime.Candies are found in wrappers of the same color regardless of the flavor. These candies are sold in bags that contains 10, the bag can be one of 5 types:
- Type 1: 100% cherry
- Type 2: 75% cherry y 25% lime
- Type 3: 50% cherry y 50% lime
- Type 4: 25% cherry y 75% lime
- Type 5: 100% lime
$$ P(\text{The bag's type is 1}) = 0.1\\ P(\text{The bag's type is 2}) = 0.2\\ P(\text{The bag's type is 3}) = 0.4\\ P(\text{The bag's type is 4}) = 0.2\\ P(\text{The bag's type is 5}) = 0.1\\ $$
If we buy a bag, how can we know what type is it?
Observations are i.i.d.
Solution:
We could decide that it's type 3 without opening any candy, but the probability is 0.4 that it is the correct type. This probability can increase or decrease if we open candies.Something we can do to solve this problem is to compute the probabilities that the bag is of type $i$, while we are opening candies from the bag. We call this "maximum a posteriori (MAP) prediction".
We define the random variable $H$ (for hypothesis) that denotes the declaration of the bag's type, then $dom(H) = \{h_1, ..., h_5\}$
According to the statement, we got the a priori distribution over $H$ which is $[0.1, 0.2, 0.4, 0.2, 0.1]$
Then the random variable $D$ that denotes the kind of candy, then $dom(D) = \{cherry, lime\}$, we assume that it's Uniformly Distributed.
$d^{(j)}$ will be the observed value at time $j$, $d$ will be the value of the observed candy.
From the statement we have:
$$ P(d = lime | h_1) = 0\\ P(d = lime | h_2) = 0.25\\ P(d = lime | h_3) = 0.50\\ P(d = lime | h_4) = 0.75\\ P(d = lime | h_4) = 1.0 $$
Given that the data is i.i.d., then $P(d = lime | h_i) = P(d^{(1)} = lime | h_i) = P(d^{(2)} = lime | h_i) = \ldots$.
Then we already have $P(H)$ and $P(D | H)$.
We want to find the probability of each hypothesis, for example, that the bag's type is 1 after opening a candy and see that it's lime:
$$P(H = h_1 | d^{(1)} = lime)$$
We could use bayesian learning to compute the probability of each hypothesis given the type of the first candy.
$$h_{MAP} = \arg\max_{H} P(H | d)$$$$ P(h_i | d) = \frac{P(d | h_i)P(h_i)}{P(d)}\\ P(h_i | d) = \alpha P(d | h_i)P(h_i) $$
Furthermore, the likelihood of the data with respect to a given hypothesis is:
$$ P(d | h_i) = \prod\limits_{j} P(d^{(j)} | h_i) $$
Actually, we shouldn't care about $\alpha$ if we wanted to find the most probable hypothesis (the maximum probability), but because we want to get a minimum probability of 0.6, we'll apply normalization.
Supposing we take a candy.
In [1]:
import numpy as np
D = ['cherry', 'lime']
# 0 es cherry y 1 es lime
d = D[np.random.randint(2, size=1)]
print d
In [2]:
def likelihood(d, h):
if h == 0:
return (1.0 if d == 'cherry' else 0)
elif h == 1:
return (0.75 if d == 'cherry' else 0.25)
elif h == 2:
return (0.50 if d == 'cherry' else 0.50)
elif h == 3:
return (0.25 if d == 'cherry' else 0.75)
elif h == 4:
return (0 if d == 'cherry' else 1.0)
h_apriori = [0.1, 0.2, 0.4, 0.2, 0.1]
In [3]:
for h in range(5):
print 'P(h_' + str(h+1) + ') = ' + str(h_apriori[h])
In [4]:
for h in range(5):
aposteriori = likelihood(d, h) * h_apriori[h]
print 'P(h_' + str(h+1) + ' | d) = ' + str(aposteriori)
And if we take 10 candies, and suppose all of them are "lime"
In [5]:
d = ['lime' for i in range(10)]
print d
With these evidences, it's necessary to compute the likelihood of the data. The a posteriori probabilities are updated each time we get evidence.
In [6]:
array_aposteriori = []
for i, _ in enumerate(d):
array_distrib = []
for h in range(5):
likelihood_ = 1
for j in range(i+1):
likelihood_ *= likelihood(d[j], h)
aposteriori = likelihood_ * h_apriori[h]
array_distrib.append(aposteriori)
array_aposteriori.append(array_distrib)
print array_aposteriori
Normalizing
In [7]:
array_aposteriori_norm = []
for array_distrib in array_aposteriori:
alpha = 1.0 / sum(array_distrib)
array_aposteriori_norm.append([p * alpha for p in array_distrib])
print array_aposteriori_norm
Plotting
In [8]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
pylab.rcParams['figure.figsize'] = (7, 7)
# appending a priori distribution
array_rows = [h_apriori] + array_aposteriori_norm
df = pd.DataFrame(array_rows, columns=['h1: 100% cherry', 'h2: 75% cherry y 25% lime', 'h3: 50% cherry y 50% lime',
'h4: 50% cherry y 50% lime', 'h5: 100% lime'])
print df
df.plot()
Out[8]:
This plot shows the learning process according to the evidence.
Now being realistic, if we take 10 random candies.
Now being realistic, if we take 10 random candies.
In [9]:
d = [D[v] for v in np.random.randint(2, size=10)]
In [10]:
def get_arrays_aposteriori(d):
array_aposteriori = []
for i, _ in enumerate(d):
array_distrib = []
for h in range(5):
likelihood_ = 1
for j in range(i+1):
likelihood_ *= likelihood(d[j], h)
aposteriori = likelihood_ * h_apriori[h]
array_distrib.append(aposteriori)
array_aposteriori.append(array_distrib)
array_aposteriori_norm = []
for array_distrib in array_aposteriori:
alpha = 1.0 / sum(array_distrib)
array_aposteriori_norm.append([p * alpha for p in array_distrib])
return array_aposteriori_norm
array_aposteriori_norm = get_arrays_aposteriori(d)
# appending a priori distribution
array_rows = [h_apriori] + array_aposteriori_norm
df = pd.DataFrame(array_rows, columns=['h1: 100% cherry', 'h2: 75% cherry y 25% lime', 'h3: 50% cherry y 50% lime',
'h4: 50% cherry y 50% lime', 'h5: 100% lime'])
print df
df.plot()
Out[10]:
In [11]:
print d
Something additional we could do is predict the next kind of candy.
$$ P(d^{(j+1)} | d^{(j)}, h_i) $$
But given that we don't care about the type of bag it belongs to, we remove the hypothesis.
Then marginalizing results.
$$ P(d^{(j+1)} | d^{(j)}) = \sum\limits_{i} P(d^{(j+1)} | d^{(j)}, h_i) P(h_i | d^{(j)})\\ $$
Then, given that the data is i.i.d., the event $d^{(j+1)}$ doesn't depend on the previous event (or the previous ones) $d^{(j)}$. Then, this just depends on the hypothesis.
$$ P(d^{(j+1)} | d^{(j)}) = \sum\limits_{i} P(d^{(j+1)} | h_i) P(h_i | d^{(j)}) $$
For example.
$$ P(lime | \text{5 limes}) =\\ P(lime | h_1)P(h_1 | \text{5 limes}) + P(lime | h_2)P(h_2 | \text{5 limes}) + P(lime | h_3)P(h_3 | \text{5 limes})\\ + P(lime | h_4)P(h_4 | \text{5 limes}) + P(lime | h_5)P(h_5 | \text{5 limes})\\ = (0 × 0) + (0.25 × 0.00122) + (0.5 × 0.07830) + (0.75 × 0.29650) + (1.0 × 0.62424)\\ = 0.88607 $$
Now, using the previous example can get
$$ P(d^{(j+1)} | d^{(j)}, h_i) $$
But given that we don't care about the type of bag it belongs to, we remove the hypothesis.
Then marginalizing results.
$$ P(d^{(j+1)} | d^{(j)}) = \sum\limits_{i} P(d^{(j+1)} | d^{(j)}, h_i) P(h_i | d^{(j)})\\ $$
Then, given that the data is i.i.d., the event $d^{(j+1)}$ doesn't depend on the previous event (or the previous ones) $d^{(j)}$. Then, this just depends on the hypothesis.
$$ P(d^{(j+1)} | d^{(j)}) = \sum\limits_{i} P(d^{(j+1)} | h_i) P(h_i | d^{(j)}) $$
For example.
$$ P(lime | \text{5 limes}) =\\ P(lime | h_1)P(h_1 | \text{5 limes}) + P(lime | h_2)P(h_2 | \text{5 limes}) + P(lime | h_3)P(h_3 | \text{5 limes})\\ + P(lime | h_4)P(h_4 | \text{5 limes}) + P(lime | h_5)P(h_5 | \text{5 limes})\\ = (0 × 0) + (0.25 × 0.00122) + (0.5 × 0.07830) + (0.75 × 0.29650) + (1.0 × 0.62424)\\ = 0.88607 $$
Now, using the previous example can get
In [12]:
d_ = 'lime'
prob_prediccion_array = []
for array_distrib in array_rows:
suma = 0
for h, aposteriori in enumerate(array_distrib):
suma += likelihood(d_, h) * aposteriori
prob_prediccion_array.append(suma)
df = pd.Series(prob_prediccion_array)
print df
df.plot()
Out[12]:
And this happens because of the candies we were taking
In [13]:
print d
We see how the probability of the candy being lime changes.
Bibliography
@book{russell2009artificial, title={Artificial intelligence: a modern approach (3rd edition)}, author={Russell, Stuart Jonathan and Norvig, Peter}, year={2009}, publisher={Prentice Hall} }@book{montgomery2010applied, title={Applied statistics and probability for engineers}, author={Montgomery, Douglas C and Runger, George C}, year={2010}, publisher={John Wiley \& Sons} }