Tutorial: Tree-based and ensemble models in Python (ANSWERS)

This is not a standalone tutorial, as most of the theory and important questions are asked in the main tutorial in R. The second tutorial serves the purpose of helping you make your own tree-based and enseble models in Python. I expect that you have already covered the video lectures and the main tutorial in R by now. So I assume that you already know what you are doing, you just don’t know how to code it in Python yet. If R tutorial provides more assistance in digesting the material, this tutorial in Python pushes you to be more independent, with minimum hints. (I used these notebooks to prepare this tutorial: 1, 2 3)

We keep on working with the same COMPAS dataset, and we will use sklearn library.

# Import the usual libraries
import sys
import sklearn
import numpy as np
import pandas as pd
import os

Load data and split it in train and test sets

Let’s load the sample of COMPAS data we used in the previous tutorial. For your convenience, I already transformed all factor variables into dummy variables using model.matrix() function in R and saved it on github as a csv file. So you can directly access the data using the link below.

data_location = "https://raw.githubusercontent.com/madina-k/DSE2021_tutorials/main/tutorial_trees/data/compas_sample500matrix.csv"
data = pd.read_csv(data_location)
data.head()
Two_yr_Recidivismyes Number_of_Priors Age_Above_FourtyFiveyes Age_Below_TwentyFiveyes FemaleMale Misdemeanoryes ethnicityCaucasian
0 1 -0.012530 0 1 0 1 1
1 1 -0.400115 0 1 1 1 0
2 1 0.701857 1 0 1 1 0
3 0 -0.930922 0 0 1 0 1
4 1 -0.545324 0 1 1 1 0

Use train_test_split command from sklearn.model_selection to split the data into four objects: X_train, X_test, y_train, y_test

from sklearn.model_selection import train_test_split


y = data.pop('Two_yr_Recidivismyes').values
X = data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Decision Tree

Train a decision tree using DecisionTreeClassifier class from sklearn.tree. Look at the documentation here and fit a decision tree using Gini node impurity measure, with minimum 25 observations in each leaf, and the maximum tree depth of 3.

from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(min_samples_leaf = 25, max_depth=4)
tree_clf.fit(X,y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=25, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

And visualize the resulting tree using plot_tree from sklearn.tree. See documentation here. Don’t forget to tell the names of the features and class names, set the font size to 14.

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(18,18))  # set plot size (denoted in inches)
plot_tree(tree_clf, feature_names=X.columns, class_names=['No', 'Yes'],fontsize=14 )
plt.show()
_images/ensemble-answers_9_0.png

This is some bushy tree but we do not have a way to prune it in an automated way with sklearn. If you really want to prune this tree, look how it can be done in this blog.

Finally, let’s check it’s out-of-sample accuracy with accuracy_score function from sklearn.metrics. Read the documentation here.

from sklearn.metrics import accuracy_score

y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree))
0.728

Bagging

Now it’s time for some bagging! Use BaggingClassifier from sklearn.ensemble. See documentation here. Create a new tree object which allows to grow a tree of maximum depth of 2 and put that classifier inside the BaggingClassifier. Ask to bag over 500 trees with maximum samples of 100 observations with bootstrapping. Print the accuracy score.

from sklearn.ensemble import BaggingClassifier

tree_clf_forbagging = DecisionTreeClassifier(max_depth=2)

bag_clf = BaggingClassifier(
    tree_clf_forbagging, n_estimators=500,
    max_samples=100, bootstrap=True)
bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_bag))
0.72

Random Forest

Now, let’s use RandomForestClassifier from sklearn.ensemble. See documentation here. Create a random forest of 500 trees, where each tree has a maximum depth of 2. Print the accuracy score for the test sample.

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=500, max_depth=2)
rf_clf.fit(X_train, y_train)

y_pred_rf = rf_clf.predict(X_test)

print(accuracy_score(y_test, y_pred_rf))
0.696

Check for the importance of each feature, using feature_importances_ property of the classifier. (For example, see this page)

for name, score in zip(list(data.columns), rnd_clf.feature_importances_):
    print(name,score)
Number_of_Priors 0.5951675288832771
Age_Above_FourtyFiveyes 0.15327959016558776
Age_Below_TwentyFiveyes 0.04963856624481278
FemaleMale 0.06136793924146804
Misdemeanoryes 0.07341448602922451
ethnicityCaucasian 0.06713188943562985

Boosting with Gradient Boosting

There are several different ways to boost trees (e.g., XGBoost library or AdaBoostClaasifier from sklearn). In this tutorial, I would ask you to train a GradientBoostingClassifier from sklearn.ensemble. See the documentation here. Make 300 trees with the maximum depth of 2. Set learning rate at 0.01.

Then use the staged_predict method to predict the target variable at each stage of boosting (1 tree only, 2 trees together, 3 trees, and so on) using the test X. Get the mean squared error using the mean_squared_error wrapper from sklearn.ensemble. See the documentation here. And find the number of trees that minimize the MSE on the test sample.

from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingClassifier

gbc_clf = GradientBoostingClassifier(max_depth=2, n_estimators=300, learning_rate=0.01)
gbc_clf.fit(X_train, y_train)

errors = [mean_squared_error(y_test, y_pred)
          for y_pred in gbc_clf.staged_predict(X_test)]

min_error = np.min(errors)
print(min_error)


optimal_n_trees = np.argmin(errors) + 1
print(optimal_n_trees)
0.304
26

Train the Gradient boosting classifier model with the optimal number of trees and and get the accuracy score.

gbc_best = GradientBoostingClassifier(max_depth=2, n_estimators=optimal_n_trees, learning_rate=0.01)
gbc_best.fit(X_train, y_train)



y_pred_gbcbest = gbc_best.predict(X_test)
print(accuracy_score(y_test, y_pred_gbrtbest))
0.696

Voting classifier

Finally we will try to combine all the other models under one Ensemble model.

Import VotingClassifier from sklearn.ensemble.

We will also import Logistic Regression and Support Vector Machine.

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver='lbfgs')
svm_clf = SVC(probability=True, gamma='scale')

Now put all different classifiers to be combined within the voting classifier and compare accuracy scores across different classifiers:

voting_clf = VotingClassifier(
    estimators = [('lr', log_clf), # Logistic classifier
                  ('rf', rf_clf),  # Random forest classifier
                  ('svm', svm_clf), # SVM classifier
                  ('bag', bag_clf), # Bagging classifier
                  ('gbrt', gbc_best)],  # Boosting classifier
    voting ='hard')

for clf in (tree_clf,log_clf, rf_clf, svm_clf, bag_clf, gbc_best, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
DecisionTreeClassifier 0.624
LogisticRegression 0.656
RandomForestClassifier 0.696
SVC 0.688
BaggingClassifier 0.728
GradientBoostingClassifier 0.696
VotingClassifier 0.72