A Random forest is a variation of the bagged trees, which usually have better performance:
Source: Machine learning flowchart created by the second place finisher of Kaggle's Driver Telematics competition
- Exactly as in bagging, we created an ensemble of decision trees using bootstrapped samples from the training set.
- However, a tree is built, and each time a partition is given, a random sample of m features is chosen as candidates for the partition from the set of p features. Only one of the m features is allowed in the partition.
- A new random sample of the features is chosen for each tree in each partition.
- In case of classification, m is normally chosen as $\sqrt{p}$.
- In case of regression, m is normally chosen as a value between $p/3 $ and $p$.
What is the point of this?
- Suppose there is a very strong feature in the dataset. When using bagged trees. Most trees will use that feature as the first partition (the root), resulting in an ensemble of similar trees that are highly correlated.
- Averaging highly correlated quantities does not significantly reduce the variance (which is the total objective of the bagging).
- Randomly leaving aside characteristics in each partitioning, the random forests decorrelate the trees, so that the process of averaging can reduce the variance of the resulting model.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(url, sep=';')
data.head()
# define X and y
X = data.iloc[:,0:11]
feature_cols = data.iloc[:,0:11].columns
y = data.quality
In [2]:
from sklearn.ensemble import RandomForestRegressor
rfreg = RandomForestRegressor()
rfreg
Out[2]:
In [3]:
# use leave-one-out cross-validation (LOOCV)
import numpy as np
from sklearn.cross_validation import cross_val_score
# lista de valores de n_estimators
estimator_range = range(10, 310, 10)
RMSE_scores = []
# 5-fold cross-validation con cada valor de n_estimators (WARNING: SLOW!)
for estimator in estimator_range:
rfreg = RandomForestRegressor(n_estimators=estimator, random_state=123)
MSE_scores = cross_val_score(rfreg, X, y, cv=5, scoring='neg_mean_squared_error')
RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))
In [4]:
# plot n_estimators (x-axis) versus RMSE (y-axis)
plt.plot(estimator_range, RMSE_scores)
plt.xlabel('n_estimators')
plt.ylabel('RMSE (lower is better)')
Out[4]:
In [5]:
error, n_estimators = sorted(zip(RMSE_scores, estimator_range))[0]
(error, n_estimators)
Out[5]:
In [6]:
# list of values, max_features
feature_range = range(1, len(feature_cols)+1)
RMSE_scores = []
# 10-fold cross-validation para cada valor de of max_features (WARNING: SLOW!)
for feature in feature_range:
rfreg = RandomForestRegressor(n_estimators=n_estimators, max_features=feature, random_state=123)
MSE_scores = cross_val_score(rfreg, X, y, cv=10, scoring='neg_mean_squared_error')
RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))
In [7]:
plt.plot(feature_range, RMSE_scores)
plt.xlabel('max_features')
plt.ylabel('RMSE (lower is better)')
Out[7]:
In [8]:
error, max_features = sorted(zip(RMSE_scores, feature_range))[0]
(error, max_features)
Out[8]:
In [9]:
rfreg = RandomForestRegressor(n_estimators=n_estimators, max_features=max_features, random_state=123)
rfreg.fit(X, y)
Out[9]:
In [10]:
# compute feature importances
sorted_features = pd.DataFrame({'feature':feature_cols, 'importance':rfreg.feature_importances_}).sort_values('importance', ascending=False)
sorted_features
Out[10]:
In [11]:
print(sorted_features[sorted_features['importance'] > 0.1])
Taking only the most important features for training seems not to give better scores, so there must be a better usage for these features I don't know about.
So far, a simple decision tree would obtain a better score. I'll go over random forest tunning again in other notes.
So far, a simple decision tree would obtain a better score. I'll go over random forest tunning again in other notes.
Comparing Random Forest with the Decision Trees
Advantages of Random Forests- Its performance competes with the best supervised learning algorithms.
- It provides a reliable estimate of feature importance.
- It allows one to estimate the "out-of-sample error" without doing "train/test split" or "cross-validation".
- Less interpretable
- Slow to train
- Slow to predict
Source: Machine learning flowchart created by the second place finisher of Kaggle's Driver Telematics competition
Something to consider
- A RandomForestClassifiers with 500 trees and one with 1000 should have roughly the same quality.
- Decision surface: The one on the left is generated by a decision tree, and the one on the right by a random forest where boundaries are smooth.