Ensembles Lab 2: Boosting

Author

Adapted by EVL, FRC and ASP

Published

March 24, 2025

# Helper packages
library(dplyr)       # for data wrangling
library(ggplot2)     # for awesome plotting
library(modeldata)  
library(foreach)     # for parallel processing with for loops

# Modeling packages
# library(tidymodels)
library(xgboost)
library(gbm)

Introduction

This lab continues on the previous one showing how to apply boosting. The same dataset as before will be used

Ames Housing dataset

Packge AmesHousing contains the data jointly with some instructions to create the required dataset.

We will use, however data from the modeldata package where some preprocessing of the data has already been performed (see: https://www.tmwr.org/ames)

The dataset has 74 variables so a descriptive analysis is not provided.

dim(ames)
[1] 2930   74
boxplot(ames)

We proceed as in the previous lab and divide the reponse variable by 1000 facilitate reviewing the results .

require(dplyr)
ames <- ames %>% mutate(Sale_Price = Sale_Price/1000)
boxplot(ames)

Spliting the data into test/train

The data are split in separate test / training sets and do it in such a way that samplig is balanced for the response variable, Sale_Price.

# Stratified sampling with the rsample package
set.seed(123)
split <- rsample::initial_split(ames, prop = 0.7, 
                       strata = "Sale_Price")
ames_train  <- training(split)
ames_test   <- testing(split)

Fitting a boosted regression tree with xgboost

XGBoost Parameters Overview

The xgboost() function in the XGBoost package trains Gradient Boosting models for regression and classification tasks. Key parameters and hyperparameters include:

  • Parameters:
    • params: List of training parameters.
    • data: Training data.
    • nrounds: Number of boosting rounds.
    • watchlist: Validation set for early stopping.
    • obj: Custom objective function.
    • feval: Custom evaluation function.
    • verbose: Verbosity level.
    • print_every_n: Print frequency.
    • early_stopping_rounds: Rounds for early stopping.
    • maximize: Maximize evaluation metric.
    • save_period: Model save frequency.
    • save_name: Name for saved model.
    • xgb_model: Existing XGBoost model.
    • callbacks: List of callback functions.

XGBoost Parameters Overview

Numerous parameters govern XGBoost’s behavior. A detailed description of all parameters can be found in the XGBoost documentation. Key considerations include those controlling tree growth, model learning rate, and early stopping to prevent overfitting:

  • Parameters:
    • booster [default = gbtree]: Type of weak learner, trees (“gbtree”, “dart”) or linear models (“gblinear”).
    • eta [default=0.3, alias: learning_rate]: Reduces each tree’s contribution by multiplying its original influence by this value.
    • gamma [default=0, alias: min_split_loss]: Minimum cost reduction required for a split to occur.
    • max_depth [default=6]: Maximum depth trees can reach.
    • subsample [default=1]: Proportion of observations used for each tree’s training. If less than 1, applies Stochastic Gradient Boosting.
    • colsample_bytree: Number of predictors considered at each split.
    • nrounds: Number of boosting iterations, i.e., the number of models in the ensemble.
    • early_stopping_rounds: Number of consecutive iterations without improvement to trigger early stopping. If NULL, early stopping is disabled. Requires a separate validation set (watchlist) for early stopping.
    • watchlist: Validation set used for early stopping.
    • seed: Seed for result reproducibility. Note: use set.seed() instead.

Test / Training in xGBoost

XGBoost Data Formats

XGBoost models can work with various data formats, including R matrices.

However, it’s advisable to use xgb.DMatrix, a specialized and optimized data structure within this library.

ames_train <- xgb.DMatrix(
                data  = ames_train %>% select(-Sale_Price)
                %>% data.matrix(),
                label = ames_train$Sale_Price,
               )

ames_test <- xgb.DMatrix(
                data  = ames_test %>% select(-Sale_Price)
                %>% data.matrix(),
                label = ames_test$Sale_Price
               )

Fit the model

set.seed(123)
ames.boost <- xgb.train(
            data    = ames_train,
            params  = list(max_depth = 2),
            nrounds = 1000,
            eta= 0.05
          )
ames.boost
##### xgb.Booster
raw: 872.5 Kb 
call:
  xgb.train(params = list(max_depth = 2), data = ames_train, nrounds = 1000, 
    eta = 0.05)
params (as set within xgb.train):
  max_depth = "2", eta = "0.05", validate_parameters = "1"
xgb.attributes:
  niter
callbacks:
  cb.print.evaluation(period = print_every_n)
# of features: 73 
niter: 1000
nfeatures : 73 

Prediction and model assessment

ames.boost.trainpred <- predict(ames.boost,
                   newdata = ames_train
                 )

ames.boost.pred <- predict(ames.boost,
                   newdata = ames_test
                 )

train_rmseboost <- sqrt(mean((ames.boost.trainpred - getinfo(ames_train, "label"))^2))

test_rmseboost <- sqrt(mean((ames.boost.pred - getinfo(ames_test, "label"))^2))

paste("Error train (rmse) in XGBoost:", round(train_rmseboost,2))
[1] "Error train (rmse) in XGBoost: 14.31"
paste("Error test (rmse) in XGBoost:", round(test_rmseboost,2))
[1] "Error test (rmse) in XGBoost: 22.69"

Parameter optimization

Extra lab “labs/Lab-C.2.4b-BoostingOptimization.qmd” shows how to optimize some important parameters for boosting.

References