Lab: Decision Trees for Classification and Regression

A guided practical session with questions

Alex Sánchez
Python version

This notebook is the Python equivalent of the reduced R lab on decision trees for classification and regression. It follows the same structure and keeps the same pedagogical aim: not only to fit trees, but to interpret them, assess prediction error, and understand pruning.

Learning goals

This lab is intended to provide an active exercise on classification and regression trees to illustrate how they can be built, used, interpreted and applied.

The main goal is not only to fit a decision tree, but to understand:

how a tree partitions the predictor space;
how a fitted tree produces class predictions and class probabilities;
why training error is not enough to assess prediction performance;
how pruning controls model complexity;
how the same ideas change when the response is quantitative rather than categorical.

The lab is intentionally written so that most questions are transferable to any similar dataset. The default dataset used here is the Pima Indians diabetes dataset, but the structure can be reused with another classification dataset with minimal changes.

Note on the questions. Questions are numbered automatically using Markdown ordered lists. Some optional questions are kept in the source file as HTML comments. To recover one of them, simply remove the comment delimiters ; the list will be renumbered automatically when the notebook is rendered.

Packages

We use pandas for data manipulation, scikit-learn for decision trees and model assessment, and matplotlib for plotting.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, StratifiedKFold, KFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree, export_text
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, mean_squared_error

pd.set_option("display.max_columns", None)

Part 1. Classification tree

1.1 Data and prediction problem

In this example we use the PimaIndiansDiabetes2 dataset from the R package mlbench.

The cell below first tries to read a local file called PimaIndiansDiabetes2.csv. If it is not found, it downloads the dataset from the Rdatasets repository. If you use a different dataset, replace this cell with the corresponding data import code.

from pathlib import Path

local_file = Path("PimaIndiansDiabetes2.csv")
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/mlbench/PimaIndiansDiabetes2.csv"

if local_file.exists():
    mydataset = pd.read_csv(local_file)
else:
    mydataset = pd.read_csv(url)

# The Rdatasets version contains a first index column called 'rownames'.
if "rownames" in mydataset.columns:
    mydataset = mydataset.drop(columns="rownames")

mydataset.head()

The response variable is diabetes, a binary variable indicating whether each individual is classified as diabetes-positive or diabetes-negative.

mydataset.info()

mydataset.describe(include="all")

mydataset.isna().sum()

Questions

What is the response variable? Is this a classification or a regression problem?
Identify at least three predictors. For each one, indicate whether it is quantitative or categorical.
Before fitting a tree, inspect the data. Are there missing values? If so, in which variables? Why could missing values matter for model fitting and model assessment?

1.2 Minimal preprocessing

Given this is a classroom lab, we will remove all missing values and use only complete cases. This is not necessarily the best strategy in a real analysis, but it keeps the focus on tree construction and model assessment.

mydataset_cc = mydataset.dropna().copy()

print("Original dimensions:", mydataset.shape)
print("Complete-case dimensions:", mydataset_cc.shape)

print("
Class distribution in the original dataset:")
print(mydataset["diabetes"].value_counts())

print("
Class distribution after complete-case filtering:")
print(mydataset_cc["diabetes"].value_counts())

Questions

What is the consequence of using dropna()? In a real study, what alternative preprocessing strategy could be considered?

1.3 Train/test split

We split the dataset into training and test sets. The model will be fitted using the training set and evaluated using the test set.

In this Python version we use a stratified split, so that the class proportions are preserved as much as possible in the training and test sets.

random_state = 123
prop_train = 0.70

X = mydataset_cc.drop(columns="diabetes")
y = mydataset_cc["diabetes"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    train_size=prop_train,
    random_state=random_state,
    stratify=y
)

train_data = X_train.copy()
train_data["diabetes"] = y_train

test_data = X_test.copy()
test_data["diabetes"] = y_test

print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

print("
Training class distribution:")
print(y_train.value_counts())

print("
Test class distribution:")
print(y_test.value_counts())

Questions

Why do we split the data into training and test sets instead of evaluating the tree on the same data used to fit it?
Check whether the class distribution is similar in the training and test sets. Why could a very unbalanced split be problematic?

Part 2. Fitting and interpreting a classification tree

2.1 Fit an initial tree

We fit a classification tree using DecisionTreeClassifier() from scikit-learn.

The initial tree is intentionally allowed to grow relatively freely. Later we will prune it using cost-complexity pruning.

tree_class = DecisionTreeClassifier(
    criterion="gini",
    random_state=random_state,
    ccp_alpha=0.0
)

tree_class.fit(X_train, y_train)

print(export_text(tree_class, feature_names=list(X_train.columns), max_depth=4))

plt.figure(figsize=(18, 9))
plot_tree(
    tree_class,
    feature_names=X_train.columns,
    class_names=tree_class.classes_,
    filled=True,
    rounded=True,
    proportion=True,
    max_depth=3,
    fontsize=9
)
plt.title("Initial classification tree, truncated to depth 3")
plt.show()

The plot displays the splitting variable and threshold, the impurity measure, the class proportions and the predicted class in each node. For readability, the displayed tree is truncated to depth 3. The fitted tree itself may be larger.

Questions

Which predictor appears in the root node? What does this suggest about its predictive role in this fitted tree?
Choose one internal split. Explain it as a decision rule of the form: “if predictor X is below/above a threshold, then observations go to…”.

Explain, in your own words, how the tree transforms a vector of predictors \(x\) into a class prediction \(\hat{y}\).

Part 3. Training error, test error and confusion matrix

3.1 Predictions on train and test data

pred_train_class = tree_class.predict(X_train)
pred_test_class = tree_class.predict(X_test)

labels = ["neg", "pos"]

conf_train = pd.DataFrame(
    confusion_matrix(y_train, pred_train_class, labels=labels),
    index=[f"Observed {x}" for x in labels],
    columns=[f"Predicted {x}" for x in labels]
)

conf_test = pd.DataFrame(
    confusion_matrix(y_test, pred_test_class, labels=labels),
    index=[f"Observed {x}" for x in labels],
    columns=[f"Predicted {x}" for x in labels]
)

print("Training confusion matrix")
display(conf_train)

print("Test confusion matrix")
display(conf_test)

train_error = 1 - accuracy_score(y_train, pred_train_class)
test_error = 1 - accuracy_score(y_test, pred_test_class)

print("Training error:", train_error)
print("Test error:", test_error)

3.2 Accuracy, sensitivity and specificity

For a binary classification problem, accuracy is often not enough. We also compute sensitivity and specificity. In this dataset, the positive class is coded as pos.

def classification_metrics(observed, predicted, positive="pos"):
    labels = list(pd.Series(observed).astype(str).unique())
    if positive not in labels:
        raise ValueError("The specified positive class is not present in the observed data.")
    negative = [x for x in sorted(labels) if x != positive][0]
    
    cm = confusion_matrix(observed, predicted, labels=[negative, positive])
    TN, FP, FN, TP = cm.ravel()
    
    accuracy = (TP + TN) / cm.sum()
    sensitivity = TP / (TP + FN) if (TP + FN) > 0 else np.nan
    specificity = TN / (TN + FP) if (TN + FP) > 0 else np.nan
    test_error = 1 - accuracy
    
    return pd.DataFrame({
        "accuracy": [accuracy],
        "sensitivity": [sensitivity],
        "specificity": [specificity],
        "test_error": [test_error]
    })

classification_metrics(y_train, pred_train_class, positive="pos")

classification_metrics(y_test, pred_test_class, positive="pos")

Questions

Compare the training error and the test error. Which one is smaller? Is this expected?
Why is the training error usually an optimistic estimate of prediction error?

In a biomedical classification problem, why might sensitivity and specificity be more informative than accuracy alone?

Part 4. Cost-complexity pruning

A large tree may fit the training data too closely. Cost-complexity pruning controls tree complexity by balancing goodness of fit and tree size.

In scikit-learn, the cost-complexity parameter is called ccp_alpha. Larger values of ccp_alpha lead to smaller trees.

4.1 Cross-validation table

We first obtain the sequence of candidate ccp_alpha values from the fitted tree. Then we evaluate each candidate by cross-validation on the training set.

path = tree_class.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

# The largest alpha usually corresponds to the root-only tree. We remove it for plotting and comparison.
ccp_alphas = ccp_alphas[:-1]

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=random_state)

rows = []
for alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=random_state, ccp_alpha=alpha)
    clf.fit(X_train, y_train)
    train_err = 1 - clf.score(X_train, y_train)
    cv_err = 1 - cross_val_score(clf, X_train, y_train, cv=cv, scoring="accuracy").mean()
    rows.append({
        "ccp_alpha": alpha,
        "n_leaves": clf.get_n_leaves(),
        "depth": clf.get_depth(),
        "train_error": train_err,
        "cv_error": cv_err
    })

cp_table = pd.DataFrame(rows)
cp_table.head(15)

plt.figure(figsize=(7, 5))
plt.plot(cp_table["ccp_alpha"], cp_table["train_error"], marker="o", label="Training error")
plt.plot(cp_table["ccp_alpha"], cp_table["cv_error"], marker="o", label="Cross-validated error")
plt.xlabel("ccp_alpha")
plt.ylabel("Error")
plt.title("Cost-complexity pruning path")
plt.legend()
plt.show()

The table contains, among others:

ccp_alpha: cost-complexity pruning parameter;
n_leaves: number of terminal nodes;
depth: depth of the tree;
train_error: training error;
cv_error: cross-validated error.

Questions

What happens to the cross-validated error as the tree becomes more complex? Does it always decrease?
Why is cross-validated error more relevant than training error for choosing the size of the tree?

4.2 Select a pruned tree

We first select the tree with minimum cross-validated error.

best_row = cp_table["cv_error"].idxmin()
best_alpha = cp_table.loc[best_row, "ccp_alpha"]
best_alpha

pruned_class = DecisionTreeClassifier(
    criterion="gini",
    random_state=random_state,
    ccp_alpha=best_alpha
)
pruned_class.fit(X_train, y_train)

plt.figure(figsize=(14, 7))
plot_tree(
    pruned_class,
    feature_names=X_train.columns,
    class_names=pruned_class.classes_,
    filled=True,
    rounded=True,
    proportion=True,
    fontsize=9
)
plt.title("Pruned classification tree")
plt.show()

4.3 Compare original and pruned trees

pred_test_pruned = pruned_class.predict(X_test)

conf_test_pruned = pd.DataFrame(
    confusion_matrix(y_test, pred_test_pruned, labels=labels),
    index=[f"Observed {x}" for x in labels],
    columns=[f"Predicted {x}" for x in labels]
)

conf_test_pruned

metrics_unpruned = classification_metrics(y_test, pred_test_class, positive="pos")
metrics_pruned = classification_metrics(y_test, pred_test_pruned, positive="pos")

comparison_class = pd.concat(
    [metrics_unpruned, metrics_pruned],
    keys=["unpruned", "pruned"]
)

comparison_class.round(3)

Questions

Has pruning reduced the size of the tree? Describe the main structural change.

Explain pruning as a form of regularization. What is being penalized?

Part 5. Regression trees using a quantitative response

We now illustrate the same ideas for a regression tree. Instead of predicting the binary response, we use glucose as a quantitative response.

This section is shorter because the aim is not to repeat the whole analysis, but to highlight what changes when the response is numerical.

5.1 Define a regression dataset

We remove the original categorical response diabetes from the predictors and use glucose as the response.

reg_data = mydataset_cc.drop(columns="diabetes").copy()

X_reg = reg_data.drop(columns="glucose")
y_reg = reg_data["glucose"]

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg,
    y_reg,
    train_size=prop_train,
    random_state=random_state
)

train_reg = X_train_reg.copy()
train_reg["glucose"] = y_train_reg

test_reg = X_test_reg.copy()
test_reg["glucose"] = y_test_reg

Questions

5.2 Fit a regression tree

tree_reg = DecisionTreeRegressor(
    random_state=random_state,
    ccp_alpha=0.0
)

tree_reg.fit(X_train_reg, y_train_reg)

print(export_text(tree_reg, feature_names=list(X_train_reg.columns), max_depth=4))

plt.figure(figsize=(18, 9))
plot_tree(
    tree_reg,
    feature_names=X_train_reg.columns,
    filled=True,
    rounded=True,
    proportion=True,
    max_depth=3,
    fontsize=9
)
plt.title("Initial regression tree, truncated to depth 3")
plt.show()

Questions

What is predicted in each terminal node of a regression tree: a class, a probability, or a numerical value?

Compare the interpretation of a terminal node in a classification tree and in a regression tree.

5.3 Prediction error for a regression tree

For regression problems, the usual error measures are based on the difference between observed and predicted numerical values.

pred_train_reg = tree_reg.predict(X_train_reg)
pred_test_reg = tree_reg.predict(X_test_reg)

mse_train = mean_squared_error(y_train_reg, pred_train_reg)
mse_test = mean_squared_error(y_test_reg, pred_test_reg)

rmse_train = np.sqrt(mse_train)
rmse_test = np.sqrt(mse_test)

print("Training MSE:", mse_train)
print("Test MSE:", mse_test)
print("Training RMSE:", rmse_train)
print("Test RMSE:", rmse_test)

plt.figure(figsize=(6, 5))
plt.scatter(pred_test_reg, y_test_reg)
plt.xlabel("Predicted glucose")
plt.ylabel("Observed glucose")
plt.title("Regression tree: observed vs predicted values")

lims = [min(pred_test_reg.min(), y_test_reg.min()), max(pred_test_reg.max(), y_test_reg.max())]
plt.plot(lims, lims, linestyle="--")
plt.show()

Questions

Why do we use MSE or RMSE instead of accuracy for a regression tree?

5.4 Pruning the regression tree

path_reg = tree_reg.cost_complexity_pruning_path(X_train_reg, y_train_reg)
ccp_alphas_reg = path_reg.ccp_alphas[:-1]

cv_reg = KFold(n_splits=10, shuffle=True, random_state=random_state)

rows = []
for alpha in ccp_alphas_reg:
    reg = DecisionTreeRegressor(random_state=random_state, ccp_alpha=alpha)
    reg.fit(X_train_reg, y_train_reg)
    train_rmse = np.sqrt(mean_squared_error(y_train_reg, reg.predict(X_train_reg)))
    cv_rmse = -cross_val_score(
        reg,
        X_train_reg,
        y_train_reg,
        cv=cv_reg,
        scoring="neg_root_mean_squared_error"
    ).mean()
    rows.append({
        "ccp_alpha": alpha,
        "n_leaves": reg.get_n_leaves(),
        "depth": reg.get_depth(),
        "train_RMSE": train_rmse,
        "cv_RMSE": cv_rmse
    })

cp_table_reg = pd.DataFrame(rows)
cp_table_reg.head(15)

plt.figure(figsize=(7, 5))
plt.plot(cp_table_reg["ccp_alpha"], cp_table_reg["train_RMSE"], marker="o", label="Training RMSE")
plt.plot(cp_table_reg["ccp_alpha"], cp_table_reg["cv_RMSE"], marker="o", label="Cross-validated RMSE")
plt.xlabel("ccp_alpha")
plt.ylabel("RMSE")
plt.title("Regression tree pruning path")
plt.legend()
plt.show()

best_row_reg = cp_table_reg["cv_RMSE"].idxmin()
best_alpha_reg = cp_table_reg.loc[best_row_reg, "ccp_alpha"]

pruned_reg = DecisionTreeRegressor(
    random_state=random_state,
    ccp_alpha=best_alpha_reg
)
pruned_reg.fit(X_train_reg, y_train_reg)

plt.figure(figsize=(12, 6))
plot_tree(
    pruned_reg,
    feature_names=X_train_reg.columns,
    filled=True,
    rounded=True,
    proportion=True,
    fontsize=9
)
plt.title("Pruned regression tree")
plt.show()

pred_test_reg_pruned = pruned_reg.predict(X_test_reg)
rmse_test_pruned = np.sqrt(mean_squared_error(y_test_reg, pred_test_reg_pruned))

comparison_reg = pd.DataFrame({
    "model": ["unpruned", "pruned"],
    "test_RMSE": [rmse_test, rmse_test_pruned],
    "terminal_nodes": [tree_reg.get_n_leaves(), pruned_reg.get_n_leaves()]
})

comparison_reg.round(3)

Questions

Part 6. Final synthesis

Answer the following questions without running more code.

Questions

Summarize, in 5-6 lines, the full workflow followed in this lab: data preparation, train/test split, model fitting, prediction error estimation, pruning and final interpretation.

Optional extension for another dataset

If another dataset is used instead of Pima, the core workflow remains the same:

define the response variable;
identify whether the task is classification or regression;
split the data into training and test sets;
fit an initial tree;
interpret the main splits and terminal nodes;
estimate test error using an appropriate metric;
use cross-validation to choose the degree of pruning;
compare the original and pruned trees;
justify the final model in terms of prediction and interpretability.