# Helper packages
library(dplyr) # for data wrangling
library(ggplot2) # for awesome plotting
library(modeldata)
library(foreach) # for parallel processing with for loops
# Modeling packages
# library(tidymodels)
library(xgboost)
library(gbm)
Ensembles Lab 2: Boosting
Introduction
This lab continues on the previous one showing how to apply boosting. The same dataset as before will be used
Ames Housing dataset
Packge AmesHousing
contains the data jointly with some instructions to create the required dataset.
We will use, however data from the modeldata
package where some preprocessing of the data has already been performed (see: https://www.tmwr.org/ames)
The dataset has 74 variables so a descriptive analysis is not provided.
dim(ames)
[1] 2930 74
boxplot(ames)
We proceed as in the previous lab and divide the reponse variable by 1000 facilitate reviewing the results .
require(dplyr)
<- ames %>% mutate(Sale_Price = Sale_Price/1000)
ames boxplot(ames)
Spliting the data into test/train
The data are split in separate test / training sets and do it in such a way that samplig is balanced for the response variable, Sale_Price
.
# Stratified sampling with the rsample package
set.seed(123)
<- rsample::initial_split(ames, prop = 0.7,
split strata = "Sale_Price")
<- training(split)
ames_train <- testing(split) ames_test
Fitting a boosted regression tree with xgboost
XGBoost Parameters Overview
The xgboost()
function in the XGBoost package trains Gradient Boosting models for regression and classification tasks. Key parameters and hyperparameters include:
- Parameters:
params
: List of training parameters.data
: Training data.nrounds
: Number of boosting rounds.watchlist
: Validation set for early stopping.obj
: Custom objective function.feval
: Custom evaluation function.verbose
: Verbosity level.print_every_n
: Print frequency.early_stopping_rounds
: Rounds for early stopping.maximize
: Maximize evaluation metric.save_period
: Model save frequency.save_name
: Name for saved model.xgb_model
: Existing XGBoost model.callbacks
: List of callback functions.
XGBoost Parameters Overview
Numerous parameters govern XGBoost’s behavior. A detailed description of all parameters can be found in the XGBoost
documentation. Key considerations include those controlling tree growth, model learning rate, and early stopping to prevent overfitting:
- Parameters:
booster
[default = gbtree]: Type of weak learner, trees (“gbtree”, “dart”) or linear models (“gblinear”).eta
[default=0.3, alias: learning_rate]: Reduces each tree’s contribution by multiplying its original influence by this value.gamma
[default=0, alias: min_split_loss]: Minimum cost reduction required for a split to occur.max_depth
[default=6]: Maximum depth trees can reach.subsample
[default=1]: Proportion of observations used for each tree’s training. If less than 1, applies Stochastic Gradient Boosting.colsample_bytree
: Number of predictors considered at each split.nrounds
: Number of boosting iterations, i.e., the number of models in the ensemble.early_stopping_rounds
: Number of consecutive iterations without improvement to trigger early stopping. If NULL, early stopping is disabled. Requires a separate validation set (watchlist) for early stopping.watchlist
: Validation set used for early stopping.seed
: Seed for result reproducibility. Note: useset.seed()
instead.
Test / Training in xGBoost
XGBoost Data Formats
XGBoost models can work with various data formats, including R matrices.
However, it’s advisable to use xgb.DMatrix
, a specialized and optimized data structure within this library.
<- xgb.DMatrix(
ames_train data = ames_train %>% select(-Sale_Price)
%>% data.matrix(),
label = ames_train$Sale_Price,
)
<- xgb.DMatrix(
ames_test data = ames_test %>% select(-Sale_Price)
%>% data.matrix(),
label = ames_test$Sale_Price
)
Fit the model
set.seed(123)
<- xgb.train(
ames.boost data = ames_train,
params = list(max_depth = 2),
nrounds = 1000,
eta= 0.05
) ames.boost
##### xgb.Booster
raw: 872.5 Kb
call:
xgb.train(params = list(max_depth = 2), data = ames_train, nrounds = 1000,
eta = 0.05)
params (as set within xgb.train):
max_depth = "2", eta = "0.05", validate_parameters = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
# of features: 73
niter: 1000
nfeatures : 73
Prediction and model assessment
<- predict(ames.boost,
ames.boost.trainpred newdata = ames_train
)
<- predict(ames.boost,
ames.boost.pred newdata = ames_test
)
<- sqrt(mean((ames.boost.trainpred - getinfo(ames_train, "label"))^2))
train_rmseboost
<- sqrt(mean((ames.boost.pred - getinfo(ames_test, "label"))^2))
test_rmseboost
paste("Error train (rmse) in XGBoost:", round(train_rmseboost,2))
[1] "Error train (rmse) in XGBoost: 14.31"
paste("Error test (rmse) in XGBoost:", round(test_rmseboost,2))
[1] "Error test (rmse) in XGBoost: 22.69"
Parameter optimization
Extra lab “labs/Lab-C.2.4b-BoostingOptimization.qmd” shows how to optimize some important parameters for boosting.