Part 1: Ensembling

Ensemble learning or ensembling is the process of combining several predictive models to produce a combined model that is more accurate.

Regression: Calculate the average of the predictions.
Classification: Use the most common prediction or take the average of the predicted probabilities.

For an ensemble to work, the models must have the following characteristics

Precision
Independent models: Their predictions are generated using different processes.

General idea: If we have a collection of independent models, the errors that each model commits will probably not be committed by the other models. This way, the mistakes made are discarded when the models are averaged.
There are two basic ensembling methods:

Manually join the individual models
Use a model that

Part 2: Manual Ensembling

How is a good manual ensemble?

Different types of models
Different combination is characteristics
Different values of parameters

Source: Machine learning flowchart created by the winner of Kaggle's CrowdFlower competition

Part 3: Bagging

Decision trees tend not to have the best precision. This occurs partly due to high variance, which means that different partitions in the training data can generate different trees.
Bagging is a general-purpose procedure that reduces the variance of a machine learning method, and it is particularly useful for decision trees. Bagging means bootstrap aggregation and refers to the aggregation of bootstrap samples.

Bootstrap sample is a random sample with replacement.

In [1]:

import numpy as np
# set a seed for reproducibility
np.random.seed(1)

# create an array of 1 through 20
nums = np.arange(1, 21)
print(nums)

# sample that array 20 times with replacement
print(np.random.choice(a=nums, size=20, replace=True))

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
[ 6 12 13  9 10 12  6 16  1 17  2 13  8 14  7 19  6 19 12 11]

How does bagging work for decision trees?

Grow B trees using B bootstrap samples of the training data.
Train each tree with respect to its boostrap sample and make predictions.
Combine the predictions.

Notes:

Each bootstrap sample must have the same size as in the original training set.
B must be a fairly long value so that the error appears to be "stable"
The trees grow deep.

The bagging increases the precision by reducing the variance, similar to how the cross-validation reduces the variance associated with the training and testing parts by partitioning several times and averaging the results.

Implementing manually "bagged decision trees" (B = 10)

In [2]:

# read in and prepare the vehicle training data
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_train.csv'
train = pd.read_csv(url)
train['vtype'] = train.vtype.map({'car':0, 'truck':1})
train

Out[2]:

	price	year	miles	doors	vtype
0	22000	2012	13000	2	0
1	14000	2010	30000	2	0
2	13000	2010	73500	4	0
3	9500	2009	78000	4	0
4	9000	2007	47000	4	0
5	4000	2006	124000	2	0
6	3000	2004	177000	4	0
7	2000	2004	209000	4	1
8	3000	2003	138000	2	0
9	1900	2003	160000	4	0
10	2500	2003	190000	2	1
11	5000	2001	62000	4	0
12	1800	1999	163000	2	1
13	1300	1997	138000	4	0

Creating 10 "bootstrap samples" that will be used to select rows of the DataFrame

In [3]:

# set a seed for reproducibility
np.random.seed(123)

samples = [np.random.choice(a=14, size=14, replace=True) for _ in range(1, 11)]
samples

Out[3]:

[array([13,  2, 12,  2,  6,  1,  3, 10, 11,  9,  6,  1,  0,  1]),
 array([ 9,  0,  0,  9,  3, 13,  4,  0,  0,  4,  1,  7,  3,  2]),
 array([ 4,  7,  2,  4,  8, 13,  0,  7,  9,  3, 12, 12,  4,  6]),
 array([ 1,  5,  6, 11,  2,  1, 12,  8,  3, 10,  5,  0, 11,  2]),
 array([10, 10,  6, 13,  2,  4, 11, 11, 13, 12,  4,  6, 13,  3]),
 array([10,  0,  6,  4,  7, 11,  6,  7,  1, 11, 10,  5,  7,  9]),
 array([ 2,  4,  8,  1, 12,  2,  1,  1,  3, 12,  5,  9,  0,  8]),
 array([11,  1,  6,  3,  3, 11,  5,  9,  7,  9,  2,  3, 11,  3]),
 array([ 3,  8,  6,  9,  7,  6,  3,  9,  6, 12,  6, 11,  6,  1]),
 array([13, 10,  3,  4,  3,  1, 13,  0,  5,  8, 13,  6, 11,  8])]

Showing the rows for the training data of the first decision tree

In [4]:

# show the rows for the first decision tree
train.iloc[samples[0], :]

Out[4]:

	price	year	miles	doors	vtype
13	1300	1997	138000	4	0
2	13000	2010	73500	4	0
12	1800	1999	163000	2	1
2	13000	2010	73500	4	0
6	3000	2004	177000	4	0
1	14000	2010	30000	2	0
3	9500	2009	78000	4	0
10	2500	2003	190000	2	1
11	5000	2001	62000	4	0
9	1900	2003	160000	4	0
6	3000	2004	177000	4	0
1	14000	2010	30000	2	0
0	22000	2012	13000	2	0
1	14000	2010	30000	2	0

We will measure the performance of the model using some test data

In [5]:

# read in and prepare the vehicle testing data
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_test.csv'
test = pd.read_csv(url)
test['vtype'] = test.vtype.map({'car':0, 'truck':1})

# definimos los datapoints de prueba
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]

test

Out[5]:

	price	year	miles	doors	vtype
0	3000	2003	130000	4	1
1	6000	2005	82500	4	0
2	12000	2010	60000	2	0

We create the deep decision tree

In [6]:

from sklearn.tree import DecisionTreeRegressor

treereg = DecisionTreeRegressor(max_depth=None, 
                                random_state=123)

We build a tree for each one of the "bootstrap samples", and with each tree we make predictions. Then there will be 10 trees.

In [7]:

# lista de predicciones de cada arbol
predictions = []

for sample in samples:
    X_train = train.iloc[sample, 1:]
    y_train = train.iloc[sample, 0]
    
    treereg.fit(X_train, y_train)
    
    y_pred = treereg.predict(X_test)
    predictions.append(y_pred)
    
# list to NumPy array
predictions = np.array(predictions)
predictions

Out[7]:

array([[ 1300.,  5000., 14000.],
       [ 1300.,  1300., 13000.],
       [ 3000.,  3000., 13000.],
       [ 4000.,  5000., 13000.],
       [ 1300.,  5000., 13000.],
       [ 4000.,  5000., 14000.],
       [ 4000.,  4000., 13000.],
       [ 4000.,  5000., 13000.],
       [ 3000.,  5000.,  9500.],
       [ 4000.,  5000.,  9000.]])

We average the predictions of each "y"

In [8]:

np.mean(predictions, axis=0)

Out[8]:

array([ 2990.,  4330., 12450.])

Calculate the RMSE

In [9]:

# calculate RMSE
from sklearn import metrics
y_pred = np.mean(predictions, axis=0)
np.sqrt(metrics.mean_squared_error(y_test, y_pred))

Out[9]:

998.5823284370031

Now we try with B = 500, but this time using scikit-learn. We define the training and test data points

In [10]:

X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]

We create a BaggingRegressor with DecisionTreeRegressor as base regressor

In [11]:

from sklearn.ensemble import BaggingRegressor
bagreg = BaggingRegressor(DecisionTreeRegressor(), 
                          n_estimators=500, 
                          bootstrap=True, 
                          oob_score=True, 
                          random_state=1)

We train and predict the averages of the predictions of the 500 trees

In [12]:

# fit and predict
bagreg.fit(X_train, y_train)
y_pred = bagreg.predict(X_test)
y_pred

Out[12]:

array([ 3344.2,  5395. , 12902. ])

Calculate the RMSE

In [13]:

np.sqrt(metrics.mean_squared_error(y_test, y_pred))

Out[13]:

657.8000304043775

Estimating the "out-of-sample error"

For "bagged models", the "out-of-sample error" can be estimated without using train/test split or cross-validation.
On average, each bagged tree uses about two thirds of the observations. For each tree, the missing observations are called out-of-bag observations.

For the "bootstrap samples" that we defined before

In [14]:

samples

Out[14]:

[array([13,  2, 12,  2,  6,  1,  3, 10, 11,  9,  6,  1,  0,  1]),
 array([ 9,  0,  0,  9,  3, 13,  4,  0,  0,  4,  1,  7,  3,  2]),
 array([ 4,  7,  2,  4,  8, 13,  0,  7,  9,  3, 12, 12,  4,  6]),
 array([ 1,  5,  6, 11,  2,  1, 12,  8,  3, 10,  5,  0, 11,  2]),
 array([10, 10,  6, 13,  2,  4, 11, 11, 13, 12,  4,  6, 13,  3]),
 array([10,  0,  6,  4,  7, 11,  6,  7,  1, 11, 10,  5,  7,  9]),
 array([ 2,  4,  8,  1, 12,  2,  1,  1,  3, 12,  5,  9,  0,  8]),
 array([11,  1,  6,  3,  3, 11,  5,  9,  7,  9,  2,  3, 11,  3]),
 array([ 3,  8,  6,  9,  7,  6,  3,  9,  6, 12,  6, 11,  6,  1]),
 array([13, 10,  3,  4,  3,  1, 13,  0,  5,  8, 13,  6, 11,  8])]

The "in-bag" observations for each sample are

In [15]:

for sample in samples:
    print(set(sample))

{0, 1, 2, 3, 6, 9, 10, 11, 12, 13}
{0, 1, 2, 3, 4, 7, 9, 13}
{0, 2, 3, 4, 6, 7, 8, 9, 12, 13}
{0, 1, 2, 3, 5, 6, 8, 10, 11, 12}
{2, 3, 4, 6, 10, 11, 12, 13}
{0, 1, 4, 5, 6, 7, 9, 10, 11}
{0, 1, 2, 3, 4, 5, 8, 9, 12}
{1, 2, 3, 5, 6, 7, 9, 11}
{1, 3, 6, 7, 8, 9, 11, 12}
{0, 1, 3, 4, 5, 6, 8, 10, 11, 13}

And the "out-of-bag" observations

In [16]:

for sample in samples:
    print(sorted(set(range(14)) - set(sample)))

[4, 5, 7, 8]
[5, 6, 8, 10, 11, 12]
[1, 5, 10, 11]
[4, 7, 9, 13]
[0, 1, 5, 7, 8, 9]
[2, 3, 8, 12, 13]
[6, 7, 10, 11, 13]
[0, 4, 8, 10, 12, 13]
[0, 2, 4, 5, 10, 13]
[2, 7, 9, 12]

Then, the "out-of-bag error" is calculated by doing the following:

For each observation in the training data, predict its response (label) using only the trees in which that observation is out-of-bag. Then, average the predictions of the trees.
Compare all the predictions against the actual response values to calculate the out-of-bag error.

When B is large enough, the out-of-bag error is an accurate estimate of the out-of-sample error.

The "out-of-bag R-squared score" (not the MSE) for B = 500

In [17]:

bagreg.oob_score_

Out[17]:

0.7986955133989982

Estimating the importance of the features

bagging increases the predictive precision, but decreases the interpretability of the model because it is no longer possible to visualize the tree to understand the importance of each characteristic.
However, we can still get a general summary of the importance of the features from bagged models.

Bagged regression trees reduces the MSE.
Bagged classification trees reduces the Gini index.

Ensemble Learning Explained

Part 1: Ensembling

Part 2: Manual Ensembling

Part 3: Bagging

Implementing manually "bagged decision trees" (B = 10)

Creating 10 "bootstrap samples" that will be used to select rows of the DataFrame

Showing the rows for the training data of the first decision tree

We will measure the performance of the model using some test data

We create the deep decision tree

We build a tree for each one of the "bootstrap samples", and with each tree we make predictions. Then there will be 10 trees.

We average the predictions of each "y"

Calculate the RMSE

Now we try with B = 500, but this time using scikit-learn. We define the training and test data points

We create a BaggingRegressor with DecisionTreeRegressor as base regressor

We train and predict the averages of the predictions of the 500 trees

Calculate the RMSE

Estimating the "out-of-sample error"

For the "bootstrap samples" that we defined before

The "in-bag" observations for each sample are

And the "out-of-bag" observations

The "out-of-bag R-squared score" (not the MSE) for B = 500

Estimating the importance of the features

Links

Some Random Posts

Search