Ensemble Learning Explained

Part 1: Ensembling

Ensemble learning or ensembling is the process of combining several predictive models to produce a combined model that is more accurate.
  • Regression: Calculate the average of the predictions.
  • Classification: Use the most common prediction or take the average of the predicted probabilities.

For an ensemble to work, the models must have the following characteristics
  • Precision
  • Independent models: Their predictions are generated using different processes.
General idea: If we have a collection of independent models, the errors that each model commits will probably not be committed by the other models. This way, the mistakes made are discarded when the models are averaged.
There are two basic ensembling methods:
  • Manually join the individual models
  • Use a model that

Part 2: Manual Ensembling

How is a good manual ensemble?
  • Different types of models
  • Different combination is characteristics
  • Different values of parameters
Machine learning flowchart
Source: Machine learning flowchart created by the winner of Kaggle's CrowdFlower competition

Part 3: Bagging

Decision trees tend not to have the best precision. This occurs partly due to high variance, which means that different partitions in the training data can generate different trees.
Bagging is a general-purpose procedure that reduces the variance of a machine learning method, and it is particularly useful for decision trees. Bagging means bootstrap aggregation and refers to the aggregation of bootstrap samples.
Bootstrap sample is a random sample with replacement.
In [1]:
import numpy as np
# set a seed for reproducibility
np.random.seed(1)

# create an array of 1 through 20
nums = np.arange(1, 21)
print(nums)

# sample that array 20 times with replacement
print(np.random.choice(a=nums, size=20, replace=True))
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
[ 6 12 13  9 10 12  6 16  1 17  2 13  8 14  7 19  6 19 12 11]
How does bagging work for decision trees?
  1. Grow B trees using B bootstrap samples of the training data.
  2. Train each tree with respect to its boostrap sample and make predictions.
  3. Combine the predictions.
Notes:
  • Each bootstrap sample must have the same size as in the original training set.
  • B must be a fairly long value so that the error appears to be "stable"
  • The trees grow deep.
The bagging increases the precision by reducing the variance, similar to how the cross-validation reduces the variance associated with the training and testing parts by partitioning several times and averaging the results.

Implementing manually "bagged decision trees" (B = 10)

In [2]:
# read in and prepare the vehicle training data
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_train.csv'
train = pd.read_csv(url)
train['vtype'] = train.vtype.map({'car':0, 'truck':1})
train
Out[2]:

price year miles doors vtype
0 22000 2012 13000 2 0
1 14000 2010 30000 2 0
2 13000 2010 73500 4 0
3 9500 2009 78000 4 0
4 9000 2007 47000 4 0
5 4000 2006 124000 2 0
6 3000 2004 177000 4 0
7 2000 2004 209000 4 1
8 3000 2003 138000 2 0
9 1900 2003 160000 4 0
10 2500 2003 190000 2 1
11 5000 2001 62000 4 0
12 1800 1999 163000 2 1
13 1300 1997 138000 4 0

Creating 10 "bootstrap samples" that will be used to select rows of the DataFrame

In [3]:
# set a seed for reproducibility
np.random.seed(123)

samples = [np.random.choice(a=14, size=14, replace=True) for _ in range(1, 11)]
samples
Out[3]:
[array([13,  2, 12,  2,  6,  1,  3, 10, 11,  9,  6,  1,  0,  1]),
 array([ 9,  0,  0,  9,  3, 13,  4,  0,  0,  4,  1,  7,  3,  2]),
 array([ 4,  7,  2,  4,  8, 13,  0,  7,  9,  3, 12, 12,  4,  6]),
 array([ 1,  5,  6, 11,  2,  1, 12,  8,  3, 10,  5,  0, 11,  2]),
 array([10, 10,  6, 13,  2,  4, 11, 11, 13, 12,  4,  6, 13,  3]),
 array([10,  0,  6,  4,  7, 11,  6,  7,  1, 11, 10,  5,  7,  9]),
 array([ 2,  4,  8,  1, 12,  2,  1,  1,  3, 12,  5,  9,  0,  8]),
 array([11,  1,  6,  3,  3, 11,  5,  9,  7,  9,  2,  3, 11,  3]),
 array([ 3,  8,  6,  9,  7,  6,  3,  9,  6, 12,  6, 11,  6,  1]),
 array([13, 10,  3,  4,  3,  1, 13,  0,  5,  8, 13,  6, 11,  8])]

Showing the rows for the training data of the first decision tree

In [4]:
# show the rows for the first decision tree
train.iloc[samples[0], :]
Out[4]:

price year miles doors vtype
13 1300 1997 138000 4 0
2 13000 2010 73500 4 0
12 1800 1999 163000 2 1
2 13000 2010 73500 4 0
6 3000 2004 177000 4 0
1 14000 2010 30000 2 0
3 9500 2009 78000 4 0
10 2500 2003 190000 2 1
11 5000 2001 62000 4 0
9 1900 2003 160000 4 0
6 3000 2004 177000 4 0
1 14000 2010 30000 2 0
0 22000 2012 13000 2 0
1 14000 2010 30000 2 0

We will measure the performance of the model using some test data

In [5]:
# read in and prepare the vehicle testing data
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_test.csv'
test = pd.read_csv(url)
test['vtype'] = test.vtype.map({'car':0, 'truck':1})

# definimos los datapoints de prueba
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]

test
Out[5]:

price year miles doors vtype
0 3000 2003 130000 4 1
1 6000 2005 82500 4 0
2 12000 2010 60000 2 0

We create the deep decision tree

In [6]:
from sklearn.tree import DecisionTreeRegressor

treereg = DecisionTreeRegressor(max_depth=None, 
                                random_state=123)

We build a tree for each one of the "bootstrap samples", and with each tree we make predictions. Then there will be 10 trees.

In [7]:
# lista de predicciones de cada arbol
predictions = []

for sample in samples:
    X_train = train.iloc[sample, 1:]
    y_train = train.iloc[sample, 0]
    
    treereg.fit(X_train, y_train)
    
    y_pred = treereg.predict(X_test)
    predictions.append(y_pred)
    
# list to NumPy array
predictions = np.array(predictions)
predictions
Out[7]:
array([[ 1300.,  5000., 14000.],
       [ 1300.,  1300., 13000.],
       [ 3000.,  3000., 13000.],
       [ 4000.,  5000., 13000.],
       [ 1300.,  5000., 13000.],
       [ 4000.,  5000., 14000.],
       [ 4000.,  4000., 13000.],
       [ 4000.,  5000., 13000.],
       [ 3000.,  5000.,  9500.],
       [ 4000.,  5000.,  9000.]])

We average the predictions of each "y"

In [8]:
np.mean(predictions, axis=0)
Out[8]:
array([ 2990.,  4330., 12450.])

Calculate the RMSE

In [9]:
# calculate RMSE
from sklearn import metrics
y_pred = np.mean(predictions, axis=0)
np.sqrt(metrics.mean_squared_error(y_test, y_pred))
Out[9]:
998.5823284370031

Now we try with B = 500, but this time using scikit-learn. We define the training and test data points

In [10]:
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]

We create a BaggingRegressor with DecisionTreeRegressor as base regressor

In [11]:
from sklearn.ensemble import BaggingRegressor
bagreg = BaggingRegressor(DecisionTreeRegressor(), 
                          n_estimators=500, 
                          bootstrap=True, 
                          oob_score=True, 
                          random_state=1)

We train and predict the averages of the predictions of the 500 trees

In [12]:
# fit and predict
bagreg.fit(X_train, y_train)
y_pred = bagreg.predict(X_test)
y_pred
Out[12]:
array([ 3344.2,  5395. , 12902. ])

Calculate the RMSE

In [13]:
np.sqrt(metrics.mean_squared_error(y_test, y_pred))
Out[13]:
657.8000304043775

Estimating the "out-of-sample error"

For "bagged models", the "out-of-sample error" can be estimated without using train/test split or cross-validation.
On average, each bagged tree uses about two thirds of the observations. For each tree, the missing observations are called out-of-bag observations.

For the "bootstrap samples" that we defined before

In [14]:
samples
Out[14]:
[array([13,  2, 12,  2,  6,  1,  3, 10, 11,  9,  6,  1,  0,  1]),
 array([ 9,  0,  0,  9,  3, 13,  4,  0,  0,  4,  1,  7,  3,  2]),
 array([ 4,  7,  2,  4,  8, 13,  0,  7,  9,  3, 12, 12,  4,  6]),
 array([ 1,  5,  6, 11,  2,  1, 12,  8,  3, 10,  5,  0, 11,  2]),
 array([10, 10,  6, 13,  2,  4, 11, 11, 13, 12,  4,  6, 13,  3]),
 array([10,  0,  6,  4,  7, 11,  6,  7,  1, 11, 10,  5,  7,  9]),
 array([ 2,  4,  8,  1, 12,  2,  1,  1,  3, 12,  5,  9,  0,  8]),
 array([11,  1,  6,  3,  3, 11,  5,  9,  7,  9,  2,  3, 11,  3]),
 array([ 3,  8,  6,  9,  7,  6,  3,  9,  6, 12,  6, 11,  6,  1]),
 array([13, 10,  3,  4,  3,  1, 13,  0,  5,  8, 13,  6, 11,  8])]

The "in-bag" observations for each sample are

In [15]:
for sample in samples:
    print(set(sample))
{0, 1, 2, 3, 6, 9, 10, 11, 12, 13}
{0, 1, 2, 3, 4, 7, 9, 13}
{0, 2, 3, 4, 6, 7, 8, 9, 12, 13}
{0, 1, 2, 3, 5, 6, 8, 10, 11, 12}
{2, 3, 4, 6, 10, 11, 12, 13}
{0, 1, 4, 5, 6, 7, 9, 10, 11}
{0, 1, 2, 3, 4, 5, 8, 9, 12}
{1, 2, 3, 5, 6, 7, 9, 11}
{1, 3, 6, 7, 8, 9, 11, 12}
{0, 1, 3, 4, 5, 6, 8, 10, 11, 13}

And the "out-of-bag" observations

In [16]:
for sample in samples:
    print(sorted(set(range(14)) - set(sample)))
[4, 5, 7, 8]
[5, 6, 8, 10, 11, 12]
[1, 5, 10, 11]
[4, 7, 9, 13]
[0, 1, 5, 7, 8, 9]
[2, 3, 8, 12, 13]
[6, 7, 10, 11, 13]
[0, 4, 8, 10, 12, 13]
[0, 2, 4, 5, 10, 13]
[2, 7, 9, 12]
Then, the "out-of-bag error" is calculated by doing the following:
  1. For each observation in the training data, predict its response (label) using only the trees in which that observation is out-of-bag. Then, average the predictions of the trees.
  2. Compare all the predictions against the actual response values to calculate the out-of-bag error.
When B is large enough, the out-of-bag error is an accurate estimate of the out-of-sample error.

The "out-of-bag R-squared score" (not the MSE) for B = 500

In [17]:
bagreg.oob_score_
Out[17]:
0.7986955133989982

Estimating the importance of the features

bagging increases the predictive precision, but decreases the interpretability of the model because it is no longer possible to visualize the tree to understand the importance of each characteristic.
However, we can still get a general summary of the importance of the features from bagged models.
  • Bagged regression trees reduces the MSE.
  • Bagged classification trees reduces the Gini index.