Here I review decision trees because they are super common in competitions.
A decision tree can be used for regression and classification, and that's why these trees are known as Classification And Regression Trees (CART)

The first thing I want to mention is that a decision tree is a non-linear model.

1. Decision Tree Regressor

Example

Data: Major League Baseball player data from 1986-87

Years (x-axis): number of years playing in the major leagues
Hits (y-axis): number of hits in the previous year
Salary (color): low salary is blue/green, and the high salary is red/yellow

A prediction model for the salary will follow some steps:

Divide the feature space into regions where data points have something in common.
We say that similarity of the regions is maximized, and similarity between different regions is minimized.
Then, according to features, the mean salary of a region is what we predict.

Considering these restrictions

Only use straight lines to divide regions.
These lines have to be vertical or horizontal.
You cannot cross the lines.

Then, a solution might look like this
Salary regions

We defined 2 cutpoints, 4.5 y 117.5, and therefore the space is divided into 3 regions.

$R_1$:
- $\text{Years} < 4.5$
- Mean salary = $165,174
$R_2$:
- $\text{Years} \geq 4.5$
- $\text{Hits} < 117.5$
- Mean salary = $402,834
$R_3$:
- $\text{Years} \geq 4.5$
- $\text{Hits} \geq 117.5$
- Mean salary = $845,346

In this way, these regions are used to make predictions.
All that forms a tree that's known as a regression tree.

It shows $\log 165174/1000 \approx 5.11$, and in the same way for the rest of the leaves. Each leaf corresponds to a region.

Then, we can just follow the branches to make predictions.

What does this say about the data?

Years are the most important factor when determining salary.
For a novice player, hits aren't important.

A deeper tree

The training error decreases while the tree gets bigger (this is called overfitting). The best deep would be 3, where the three errors are close to each other.

The ideal cut point for a feature

This is another dataset with wine data where the point is to predict 'quality'.

In [1]:

import pandas as pd
url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
train = pd.read_csv(url, sep=';')
train.head()

Out[1]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

Visualizing

In [2]:

train.plot(kind='scatter', x='alcohol', y='sulphates', c='quality', colormap='jet')

Out[2]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b7276d8>

In [3]:

train.describe()

Out[3]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000
mean	8.319637	0.527821	0.270976	2.538806	0.087467	15.874922	46.467792	0.996747	3.311113	0.658149	10.422983	5.636023
std	1.741096	0.179060	0.194801	1.409928	0.047065	10.460157	32.895324	0.001887	0.154386	0.169507	1.065668	0.807569
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000
25%	7.100000	0.390000	0.090000	1.900000	0.070000	7.000000	22.000000	0.995600	3.210000	0.550000	9.500000	5.000000
50%	7.900000	0.520000	0.260000	2.200000	0.079000	14.000000	38.000000	0.996750	3.310000	0.620000	10.200000	6.000000
75%	9.200000	0.640000	0.420000	2.600000	0.090000	21.000000	62.000000	0.997835	3.400000	0.730000	11.100000	6.000000
max	15.900000	1.580000	1.000000	15.500000	0.611000	72.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000

We try to do it using only the average quality.

In [4]:

train['prediction'] = train.quality.mean()
train['prediction'].values

Out[4]:

array([5.63602251, 5.63602251, 5.63602251, ..., 5.63602251, 5.63602251,
       5.63602251])

The RMSE is calculated for predictions

In [5]:

from sklearn import metrics
import numpy as np
np.sqrt(metrics.mean_squared_error(train.quality, train.prediction))

Out[5]:

0.8073168769639513

To reduce this value we define a function that makes predictions according to a partition on 'alcohol'

In [6]:

def alcohol_split(alcohol):
    lower_quality = train[train.alcohol < alcohol].quality.mean()
    higher_quality = train[train.alcohol >= alcohol].quality.mean()
    train['prediction'] = np.where(train.alcohol < alcohol, lower_quality, higher_quality)
    return np.sqrt(metrics.mean_squared_error(train.quality, train.prediction))

In [7]:

print('RMSE:', alcohol_split(10))

RMSE: 0.7419890561315725

The same can be done for several possible partitions

In [8]:

alcohol_range = np.linspace(train.alcohol.min(), train.alcohol.max(), 2000)
RMSE = [alcohol_split(alcohol) for alcohol in alcohol_range]

And we graph the possible RMSE

In [9]:

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['font.size'] = 14

In [10]:

plt.plot(alcohol_range, RMSE)
plt.xlabel('Alcohol cutpoint')
plt.ylabel('RMSE (lower is better)')

Out[10]:

Text(0,0.5,'RMSE (lower is better)')

Then for the feature alcohol, the best cut point would be around 10.5.

In summary: To make a partition, this process is performed for each feature, then the feature and cutoff point of the smallest MSE is one that we must choose.

How does a computer build a regression tree?

Ideal approach: Consider all possible partitions of the feature space (computationally infeasible)
Pretty good approach: Recursive binary partitioning

Starting at the root of the tree
For each feature, examine all possible cut points and choose a feature and cutpoint so that the resulting tree has the lowest mean squared error (MSE). Do this partition.
Examine the resulting regions, and again make a partition on one of the regions that we have, minimizing the MSE.
Continue doing step 3 until a stop criterion is fulfilled, that could be: - Maximum tree depth - Minimum number of data points on a leaf

Building a regression tree using scikit-learn

In [11]:

X = train.iloc[:,0:11]
feature_cols = train.iloc[:,0:11].columns
y = train.quality

In [12]:

from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(splitter='best', random_state=1)
treereg

Out[12]:

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

Validacion cruzada

In [13]:

# use leave-one-out cross-validation (LOOCV)
from sklearn.cross_validation import cross_val_score

# cv=len(X) para que sea leave-one-out cross-validation (LOOCV)
scores = cross_val_score(treereg, X, y, cv=len(X), scoring='neg_mean_squared_error')

np.mean(np.sqrt(-scores))

Out[13]:

0.44152595372107567

Tuning

Let's try to reduce the RMSE by changing the parameter max_depth .

In [14]:

# list of values to try
max_depth_range = range(1, 8)

# list to store the RMSE for each value of max_depth
RMSE_scores = []

# use LOOCV with each value of max_depth
for depth in max_depth_range:
    treereg = DecisionTreeRegressor(max_depth=depth, random_state=1)
    MSE_scores = cross_val_score(treereg, X, y, cv=14, scoring='neg_mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))

Graph the RMSE for each value of max_depth

In [15]:

# plot max_depth (x-axis) versus RMSE (y-axis)
plt.plot(max_depth_range, RMSE_scores)
plt.xlabel('max_depth')
plt.ylabel('RMSE (lower is better)')

Out[15]:

Text(0,0.5,'RMSE (lower is better)')

The best value of the parameter is 3

In [16]:

# max_depth=3 was best, so fit a tree using that parameter
treereg = DecisionTreeRegressor(max_depth=3, random_state=1)
treereg.fit(X, y)

Out[16]:

DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

We can measure the importance of the features after training

The Gini importance of each feature is the reduction of the error produced by considering the feature

In [17]:

pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_})

Out[17]:

	feature	importance
0	fixed acidity	0.000000
1	volatile acidity	0.160126
2	citric acid	0.000000
3	residual sugar	0.000000
4	chlorides	0.000000
5	free sulfur dioxide	0.000000
6	total sulfur dioxide	0.000000
7	density	0.000000
8	pH	0.000000
9	sulphates	0.244731
10	alcohol	0.595143

Visualizing the tree

In [18]:

# create a Graphviz file
from sklearn.tree import export_graphviz
export_graphviz(treereg, out_file='images/tree_wine.dot', feature_names=feature_cols)

# At the command line, run this to convert to PNG:
#   dot -Tpng tree_wine.dot -o tree_wine.png

Reading the internal nodes

samples: number of observations in the node before splitting
mse: MSE calculated by comparing the values of the current response in the node against the average response in the average node
rule: rule used to split the node (to the left if it is true and to the right if it is not)

Reading the leaves:

samples: number of observations in the node
value: average response in the node
mse: MSE calculated by comparing the current response values in the node against the "value" shown in the node

2. Decision Tree Classifier

Regression trees	Classification trees
predict a continuous response	predict a categorical response
predict using the average leaf response	predict using the most common class on the leaf
the partitions are chosen so that they minimize the MSE	the partitions are chosen so that they minimize the "Gini index"

Partitioning criteria

Some common partitioning criteria:

Classification error rate: fraction of training data points in the region that do not belong to the most common class.
Gini index: total variance of the classes in a region.

That Gini index belongs to a larger topic, so I may go over it later.

Classification And Regression Trees (CART) Explained