Classification And Regression Trees (CART) Explained


Here I review decision trees because they are super common in competitions.
A decision tree can be used for regression and classification, and that's why these trees are known as Classification And Regression Trees (CART)
  • The first thing I want to mention is that a decision tree is a non-linear model.

1. Decision Tree Regressor

Example

Data: Major League Baseball player data from 1986-87
  • Years (x-axis): number of years playing in the major leagues
  • Hits (y-axis): number of hits in the previous year
  • Salary (color): low salary is blue/green, and the high salary is red/yellow
Salary data
A prediction model for the salary will follow some steps:
  • Divide the feature space into regions where data points have something in common.
  • We say that similarity of the regions is maximized, and similarity between different regions is minimized.
  • Then, according to features, the mean salary of a region is what we predict.
Considering these restrictions
  • Only use straight lines to divide regions.
  • These lines have to be vertical or horizontal.
  • You cannot cross the lines.
Then, a solution might look like this
Salary regions
We defined 2 cutpoints, 4.5 y 117.5, and therefore the space is divided into 3 regions.
  • $R_1$:
    • $\text{Years} < 4.5$
    • Mean salary = $165,174
  • $R_2$:
    • $\text{Years} \geq 4.5$
    • $\text{Hits} < 117.5$
    • Mean salary = $402,834
  • $R_3$:
    • $\text{Years} \geq 4.5$
    • $\text{Hits} \geq 117.5$
    • Mean salary = $845,346
In this way, these regions are used to make predictions.
All that forms a tree that's known as a regression tree.
Salary tree
It shows $\log 165174/1000 \approx 5.11$, and in the same way for the rest of the leaves. Each leaf corresponds to a region.
Then, we can just follow the branches to make predictions.
Salary tree annotated
What does this say about the data?
  • Years are the most important factor when determining salary.
  • For a novice player, hits aren't important.

A deeper tree

Salary tree grown deep
The training error decreases while the tree gets bigger (this is called overfitting). The best deep would be 3, where the three errors are close to each other.

The ideal cut point for a feature

This is another dataset with wine data where the point is to predict 'quality'.
In [1]:
import pandas as pd
url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
train = pd.read_csv(url, sep=';')
train.head()
Out[1]:

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

Visualizing

In [2]:
train.plot(kind='scatter', x='alcohol', y='sulphates', c='quality', colormap='jet')
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b7276d8>
In [3]:
train.describe()
Out[3]:

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000

We try to do it using only the average quality.

In [4]:
train['prediction'] = train.quality.mean()
train['prediction'].values
Out[4]:
array([5.63602251, 5.63602251, 5.63602251, ..., 5.63602251, 5.63602251,
       5.63602251])

The RMSE is calculated for predictions

In [5]:
from sklearn import metrics
import numpy as np
np.sqrt(metrics.mean_squared_error(train.quality, train.prediction))
Out[5]:
0.8073168769639513

To reduce this value we define a function that makes predictions according to a partition on 'alcohol'

In [6]:
def alcohol_split(alcohol):
    lower_quality = train[train.alcohol < alcohol].quality.mean()
    higher_quality = train[train.alcohol >= alcohol].quality.mean()
    train['prediction'] = np.where(train.alcohol < alcohol, lower_quality, higher_quality)
    return np.sqrt(metrics.mean_squared_error(train.quality, train.prediction))
In [7]:
print('RMSE:', alcohol_split(10))
RMSE: 0.7419890561315725

The same can be done for several possible partitions

In [8]:
alcohol_range = np.linspace(train.alcohol.min(), train.alcohol.max(), 2000)
RMSE = [alcohol_split(alcohol) for alcohol in alcohol_range]

And we graph the possible RMSE

In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['font.size'] = 14
In [10]:
plt.plot(alcohol_range, RMSE)
plt.xlabel('Alcohol cutpoint')
plt.ylabel('RMSE (lower is better)')
Out[10]:
Text(0,0.5,'RMSE (lower is better)')

Then for the feature alcohol, the best cut point would be around 10.5.

In summary: To make a partition, this process is performed for each feature, then the feature and cutoff point of the smallest MSE is one that we must choose.

How does a computer build a regression tree?

Ideal approach: Consider all possible partitions of the feature space (computationally infeasible)
Pretty good approach: Recursive binary partitioning
  1. Starting at the root of the tree
  2. For each feature, examine all possible cut points and choose a feature and cutpoint so that the resulting tree has the lowest mean squared error (MSE). Do this partition.
  3. Examine the resulting regions, and again make a partition on one of the regions that we have, minimizing the MSE.
  4. Continue doing step 3 until a stop criterion is fulfilled, that could be:      - Maximum tree depth      - Minimum number of data points on a leaf

Building a regression tree using scikit-learn

In [11]:
X = train.iloc[:,0:11]
feature_cols = train.iloc[:,0:11].columns
y = train.quality
In [12]:
from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(splitter='best', random_state=1)
treereg
Out[12]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

Validacion cruzada

In [13]:
# use leave-one-out cross-validation (LOOCV)
from sklearn.cross_validation import cross_val_score

# cv=len(X) para que sea leave-one-out cross-validation (LOOCV)
scores = cross_val_score(treereg, X, y, cv=len(X), scoring='neg_mean_squared_error')

np.mean(np.sqrt(-scores))
Out[13]:
0.44152595372107567

Tuning

Let's try to reduce the RMSE by changing the parameter max_depth .
In [14]:
# list of values to try
max_depth_range = range(1, 8)

# list to store the RMSE for each value of max_depth
RMSE_scores = []

# use LOOCV with each value of max_depth
for depth in max_depth_range:
    treereg = DecisionTreeRegressor(max_depth=depth, random_state=1)
    MSE_scores = cross_val_score(treereg, X, y, cv=14, scoring='neg_mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))

Graph the RMSE for each value of max_depth

In [15]:
# plot max_depth (x-axis) versus RMSE (y-axis)
plt.plot(max_depth_range, RMSE_scores)
plt.xlabel('max_depth')
plt.ylabel('RMSE (lower is better)')
Out[15]:
Text(0,0.5,'RMSE (lower is better)')

The best value of the parameter is 3

In [16]:
# max_depth=3 was best, so fit a tree using that parameter
treereg = DecisionTreeRegressor(max_depth=3, random_state=1)
treereg.fit(X, y)
Out[16]:
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

We can measure the importance of the features after training

The Gini importance of each feature is the reduction of the error produced by considering the feature
In [17]:
pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_})
Out[17]:

feature importance
0 fixed acidity 0.000000
1 volatile acidity 0.160126
2 citric acid 0.000000
3 residual sugar 0.000000
4 chlorides 0.000000
5 free sulfur dioxide 0.000000
6 total sulfur dioxide 0.000000
7 density 0.000000
8 pH 0.000000
9 sulphates 0.244731
10 alcohol 0.595143

Visualizing the tree

In [18]:
# create a Graphviz file
from sklearn.tree import export_graphviz
export_graphviz(treereg, out_file='images/tree_wine.dot', feature_names=feature_cols)

# At the command line, run this to convert to PNG:
#   dot -Tpng tree_wine.dot -o tree_wine.png
Tree for vehicle data
Reading the internal nodes
  • samples: number of observations in the node before splitting
  • mse: MSE calculated by comparing the values of the current response in the node against the average response in the average node
  • rule: rule used to split the node (to the left if it is true and to the right if it is not)
Reading the leaves:
  • samples: number of observations in the node
  • value: average response in the node
  • mse: MSE calculated by comparing the current response values in the node against the "value" shown in the node

2. Decision Tree Classifier

Regression trees Classification trees
predict a continuous response predict a categorical response
predict using the average leaf response predict using the most common class on the leaf
the partitions are chosen so that they minimize the MSE the partitions are chosen so that they minimize the "Gini index"

Partitioning criteria

Some common partitioning criteria:
  • Classification error rate: fraction of training data points in the region that do not belong to the most common class.
  • Gini index: total variance of the classes in a region.
That Gini index belongs to a larger topic, so I may go over it later.