Here I review decision trees because they are super common in competitions.
A decision tree can be used for regression and classification, and that's why these trees are known as Classification And Regression Trees (CART)
A decision tree can be used for regression and classification, and that's why these trees are known as Classification And Regression Trees (CART)
A prediction model for the salary will follow some steps:
- Divide the feature space into regions where data points have something in common.
- We say that similarity of the regions is maximized, and similarity between different regions is minimized.
- Then, according to features, the mean salary of a region is what we predict.
- Only use straight lines to divide regions.
- These lines have to be vertical or horizontal.
- You cannot cross the lines.
Then, a solution might look like this
We defined 2 cutpoints, 4.5 y 117.5, and therefore the space is divided into 3 regions.
All that forms a tree that's known as a regression tree.
- $R_1$:
- $\text{Years} < 4.5$
- Mean salary = $165,174
- $R_2$:
- $\text{Years} \geq 4.5$
- $\text{Hits} < 117.5$
- Mean salary = $402,834
- $R_3$:
- $\text{Years} \geq 4.5$
- $\text{Hits} \geq 117.5$
- Mean salary = $845,346
All that forms a tree that's known as a regression tree.
It shows $\log 165174/1000 \approx 5.11$, and in the same way for the rest of the leaves. Each leaf corresponds to a region.
Then, we can just follow the branches to make predictions.
What does this say about the data?
- Years are the most important factor when determining salary.
- For a novice player, hits aren't important.
The training error decreases while the tree gets bigger (this is called overfitting). The best deep would be 3, where the three errors are close to each other.
In [1]:
import pandas as pd
url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
train = pd.read_csv(url, sep=';')
train.head()
Out[1]:
In [2]:
train.plot(kind='scatter', x='alcohol', y='sulphates', c='quality', colormap='jet')
Out[2]:
In [3]:
train.describe()
Out[3]:
In [4]:
train['prediction'] = train.quality.mean()
train['prediction'].values
Out[4]:
In [5]:
from sklearn import metrics
import numpy as np
np.sqrt(metrics.mean_squared_error(train.quality, train.prediction))
Out[5]:
In [6]:
def alcohol_split(alcohol):
lower_quality = train[train.alcohol < alcohol].quality.mean()
higher_quality = train[train.alcohol >= alcohol].quality.mean()
train['prediction'] = np.where(train.alcohol < alcohol, lower_quality, higher_quality)
return np.sqrt(metrics.mean_squared_error(train.quality, train.prediction))
In [7]:
print('RMSE:', alcohol_split(10))
In [8]:
alcohol_range = np.linspace(train.alcohol.min(), train.alcohol.max(), 2000)
RMSE = [alcohol_split(alcohol) for alcohol in alcohol_range]
In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['font.size'] = 14
In [10]:
plt.plot(alcohol_range, RMSE)
plt.xlabel('Alcohol cutpoint')
plt.ylabel('RMSE (lower is better)')
Out[10]:
How does a computer build a regression tree?
Ideal approach: Consider all possible partitions of the feature space (computationally infeasible)Pretty good approach: Recursive binary partitioning
- Starting at the root of the tree
- For each feature, examine all possible cut points and choose a feature and cutpoint so that the resulting tree has the lowest mean squared error (MSE). Do this partition.
- Examine the resulting regions, and again make a partition on one of the regions that we have, minimizing the MSE.
- Continue doing step 3 until a stop criterion is fulfilled, that could be: - Maximum tree depth - Minimum number of data points on a leaf
In [11]:
X = train.iloc[:,0:11]
feature_cols = train.iloc[:,0:11].columns
y = train.quality
In [12]:
from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(splitter='best', random_state=1)
treereg
Out[12]:
In [13]:
# use leave-one-out cross-validation (LOOCV)
from sklearn.cross_validation import cross_val_score
# cv=len(X) para que sea leave-one-out cross-validation (LOOCV)
scores = cross_val_score(treereg, X, y, cv=len(X), scoring='neg_mean_squared_error')
np.mean(np.sqrt(-scores))
Out[13]:
In [14]:
# list of values to try
max_depth_range = range(1, 8)
# list to store the RMSE for each value of max_depth
RMSE_scores = []
# use LOOCV with each value of max_depth
for depth in max_depth_range:
treereg = DecisionTreeRegressor(max_depth=depth, random_state=1)
MSE_scores = cross_val_score(treereg, X, y, cv=14, scoring='neg_mean_squared_error')
RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))
In [15]:
# plot max_depth (x-axis) versus RMSE (y-axis)
plt.plot(max_depth_range, RMSE_scores)
plt.xlabel('max_depth')
plt.ylabel('RMSE (lower is better)')
Out[15]:
In [16]:
# max_depth=3 was best, so fit a tree using that parameter
treereg = DecisionTreeRegressor(max_depth=3, random_state=1)
treereg.fit(X, y)
Out[16]:
In [17]:
pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_})
Out[17]:
In [18]:
# create a Graphviz file
from sklearn.tree import export_graphviz
export_graphviz(treereg, out_file='images/tree_wine.dot', feature_names=feature_cols)
# At the command line, run this to convert to PNG:
# dot -Tpng tree_wine.dot -o tree_wine.png
Reading the internal nodes
- samples: number of observations in the node before splitting
- mse: MSE calculated by comparing the values of the current response in the node against the average response in the average node
- rule: rule used to split the node (to the left if it is true and to the right if it is not)
- samples: number of observations in the node
- value: average response in the node
- mse: MSE calculated by comparing the current response values in the node against the "value" shown in the node
2. Decision Tree Classifier
Regression trees | Classification trees |
---|---|
predict a continuous response | predict a categorical response |
predict using the average leaf response | predict using the most common class on the leaf |
the partitions are chosen so that they minimize the MSE | the partitions are chosen so that they minimize the "Gini index" |
Partitioning criteria
Some common partitioning criteria:- Classification error rate: fraction of training data points in the region that do not belong to the most common class.
- Gini index: total variance of the classes in a region.