Part 1: Ensembling
Ensemble learning or ensembling is the process of combining several predictive models to produce a combined model that is more accurate.- Regression: Calculate the average of the predictions.
- Classification: Use the most common prediction or take the average of the predicted probabilities.
For an ensemble to work, the models must have the following characteristics
- Precision
- Independent models: Their predictions are generated using different processes.
There are two basic ensembling methods:
- Manually join the individual models
- Use a model that
Part 3: Bagging
Decision trees tend not to have the best precision. This occurs partly due to high variance, which means that different partitions in the training data can generate different trees.Bagging is a general-purpose procedure that reduces the variance of a machine learning method, and it is particularly useful for decision trees. Bagging means bootstrap aggregation and refers to the aggregation of bootstrap samples.
Bootstrap sample is a random sample with replacement.
In [1]:
import numpy as np
# set a seed for reproducibility
np.random.seed(1)
# create an array of 1 through 20
nums = np.arange(1, 21)
print(nums)
# sample that array 20 times with replacement
print(np.random.choice(a=nums, size=20, replace=True))
How does bagging work for decision trees?
- Grow B trees using B bootstrap samples of the training data.
- Train each tree with respect to its boostrap sample and make predictions.
- Combine the predictions.
- Each bootstrap sample must have the same size as in the original training set.
- B must be a fairly long value so that the error appears to be "stable"
- The trees grow deep.
In [2]:
# read in and prepare the vehicle training data
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_train.csv'
train = pd.read_csv(url)
train['vtype'] = train.vtype.map({'car':0, 'truck':1})
train
Out[2]:
In [3]:
# set a seed for reproducibility
np.random.seed(123)
samples = [np.random.choice(a=14, size=14, replace=True) for _ in range(1, 11)]
samples
Out[3]:
In [4]:
# show the rows for the first decision tree
train.iloc[samples[0], :]
Out[4]:
In [5]:
# read in and prepare the vehicle testing data
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_test.csv'
test = pd.read_csv(url)
test['vtype'] = test.vtype.map({'car':0, 'truck':1})
# definimos los datapoints de prueba
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]
test
Out[5]:
In [6]:
from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(max_depth=None,
random_state=123)
In [7]:
# lista de predicciones de cada arbol
predictions = []
for sample in samples:
X_train = train.iloc[sample, 1:]
y_train = train.iloc[sample, 0]
treereg.fit(X_train, y_train)
y_pred = treereg.predict(X_test)
predictions.append(y_pred)
# list to NumPy array
predictions = np.array(predictions)
predictions
Out[7]:
In [8]:
np.mean(predictions, axis=0)
Out[8]:
In [9]:
# calculate RMSE
from sklearn import metrics
y_pred = np.mean(predictions, axis=0)
np.sqrt(metrics.mean_squared_error(y_test, y_pred))
Out[9]:
In [10]:
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]
In [11]:
from sklearn.ensemble import BaggingRegressor
bagreg = BaggingRegressor(DecisionTreeRegressor(),
n_estimators=500,
bootstrap=True,
oob_score=True,
random_state=1)
In [12]:
# fit and predict
bagreg.fit(X_train, y_train)
y_pred = bagreg.predict(X_test)
y_pred
Out[12]:
In [13]:
np.sqrt(metrics.mean_squared_error(y_test, y_pred))
Out[13]:
Estimating the "out-of-sample error"
For "bagged models", the "out-of-sample error" can be estimated without using train/test split or cross-validation.On average, each bagged tree uses about two thirds of the observations. For each tree, the missing observations are called out-of-bag observations.
In [14]:
samples
Out[14]:
In [15]:
for sample in samples:
print(set(sample))
In [16]:
for sample in samples:
print(sorted(set(range(14)) - set(sample)))
Then, the "out-of-bag error" is calculated by doing the following:
- For each observation in the training data, predict its response (label) using only the trees in which that observation is out-of-bag. Then, average the predictions of the trees.
- Compare all the predictions against the actual response values to calculate the out-of-bag error.
In [17]:
bagreg.oob_score_
Out[17]:
Estimating the importance of the features
bagging increases the predictive precision, but decreases the interpretability of the model because it is no longer possible to visualize the tree to understand the importance of each characteristic.However, we can still get a general summary of the importance of the features from bagged models.
- Bagged regression trees reduces the MSE.
- Bagged classification trees reduces the Gini index.