Random Forests in Python


A Random forest is a variation of the bagged trees, which usually have better performance:
  • Exactly as in bagging, we created an ensemble of decision trees using bootstrapped samples from the training set.
  • However, a tree is built, and each time a partition is given, a random sample of m features is chosen as candidates for the partition from the set of p features. Only one of the m features is allowed in the partition.
  • A new random sample of the features is chosen for each tree in each partition.
  • In case of classification, m is normally chosen as $\sqrt{p}$.
  • In case of regression, m is normally chosen as a value between $p/3 $ and $p$.
What is the point of this?
  • Suppose there is a very strong feature in the dataset. When using bagged trees. Most trees will use that feature as the first partition (the root), resulting in an ensemble of similar trees that are highly correlated.
  • Averaging highly correlated quantities does not significantly reduce the variance (which is the total objective of the bagging).
  • Randomly leaving aside characteristics in each partitioning, the random forests decorrelate the trees, so that the process of averaging can reduce the variance of the resulting model.

Building and Tuning a Random Forest

  • Objective: Predict wine quality

Preparing the data

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(url, sep=';')
data.head()

# define X and y
X = data.iloc[:,0:11]
feature_cols = data.iloc[:,0:11].columns
y = data.quality
In [2]:
from sklearn.ensemble import RandomForestRegressor
rfreg = RandomForestRegressor()
rfreg
Out[2]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Finding the best n_estimators

A very important parameter is "n_estimators", which is the number of trees that must grow. It must be large enough so that the error is stabilized.
In [3]:
# use leave-one-out cross-validation (LOOCV)
import numpy as np
from sklearn.cross_validation import cross_val_score

# lista de valores de n_estimators
estimator_range = range(10, 310, 10)

RMSE_scores = []

# 5-fold cross-validation con cada valor de n_estimators (WARNING: SLOW!)
for estimator in estimator_range:
    rfreg = RandomForestRegressor(n_estimators=estimator, random_state=123)
    MSE_scores = cross_val_score(rfreg, X, y, cv=5, scoring='neg_mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))

Visualizing

In [4]:
# plot n_estimators (x-axis) versus RMSE (y-axis)
plt.plot(estimator_range, RMSE_scores)
plt.xlabel('n_estimators')
plt.ylabel('RMSE (lower is better)')
Out[4]:
Text(0,0.5,'RMSE (lower is better)')

The best RMSE and its corresponding n_estimators (which improves substantially, we will improve one more parameter)

In [5]:
error, n_estimators = sorted(zip(RMSE_scores, estimator_range))[0]
(error, n_estimators)
Out[5]:
(0.6472512055000024, 160)

Finding the best max_features

The other important parameter is max_features, which is the number of characteristics that must be considered in each partitioning.
In [6]:
# list of values, max_features
feature_range = range(1, len(feature_cols)+1)

RMSE_scores = []

# 10-fold cross-validation para cada valor de of max_features (WARNING: SLOW!)
for feature in feature_range:
    rfreg = RandomForestRegressor(n_estimators=n_estimators, max_features=feature, random_state=123)
    MSE_scores = cross_val_score(rfreg, X, y, cv=10, scoring='neg_mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))

Visualizing

In [7]:
plt.plot(feature_range, RMSE_scores)
plt.xlabel('max_features')
plt.ylabel('RMSE (lower is better)')
Out[7]:
Text(0,0.5,'RMSE (lower is better)')

The best RMSE and its corresponding max_features

In [8]:
error, max_features = sorted(zip(RMSE_scores, feature_range))[0]
(error, max_features)
Out[8]:
(0.6413656894616142, 3)

Training using these parameters

In [9]:
rfreg = RandomForestRegressor(n_estimators=n_estimators, max_features=max_features, random_state=123)
rfreg.fit(X, y)
Out[9]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=3, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=160, n_jobs=1, oob_score=False, random_state=123,
           verbose=0, warm_start=False)

Feature Importance

In [10]:
# compute feature importances
sorted_features = pd.DataFrame({'feature':feature_cols, 'importance':rfreg.feature_importances_}).sort_values('importance', ascending=False)
sorted_features
Out[10]:
feature importance
10 alcohol 0.192807
9 sulphates 0.131710
1 volatile acidity 0.124244
7 density 0.087652
6 total sulfur dioxide 0.083081
2 citric acid 0.075522
4 chlorides 0.072232
0 fixed acidity 0.062628
8 pH 0.060530
3 residual sugar 0.058238
5 free sulfur dioxide 0.051357
In [11]:
print(sorted_features[sorted_features['importance'] > 0.1])
             feature  importance
10           alcohol    0.192807
9          sulphates    0.131710
1   volatile acidity    0.124244
Taking only the most important features for training seems not to give better scores, so there must be a better usage for these features I don't know about.
So far, a simple decision tree would obtain a better score. I'll go over random forest tunning again in other notes.

Comparing Random Forest with the Decision Trees

Advantages of Random Forests
  • Its performance competes with the best supervised learning algorithms.
  • It provides a reliable estimate of feature importance.
  • It allows one to estimate the "out-of-sample error" without doing "train/test split" or "cross-validation".
Disadvantages
  • Less interpretable
  • Slow to train
  • Slow to predict
Machine learning flowchart
Source: Machine learning flowchart created by the second place finisher of Kaggle's Driver Telematics competition
Something to consider
  • A RandomForestClassifiers with 500 trees and one with 1000 should have roughly the same quality.
  • Decision surface: The one on the left is generated by a decision tree, and the one on the right by a random forest where boundaries are smooth.