Introduction to Ensembles
Bagging and the Bootstrap
Random Forests
Boosting and its variants
Models that perform only slightly better than random guessing are called weak learners (Bishop 2007).
Weak learners low predictive accuracy may be due, to the predictor having high bias or high variance.
In many situations trees may be a good option (e.g. for simplicity and interpretability)
But there are issues that, if solved, may improve performance.
It is to be noted, too, that these problems are not unique to trees.
How can a model be made less variable or less biased without this increasing its bias or variance?
There are distinct appraches to deal with this problem
Ensemble learning takes a distinct based on “the wisdom of crowds”.
Predictors, also called, ensembles are built by fitting repeated (weak learners) models on the same data and combining them to form a single result.
As a general rule, ensemble learners tend to improve the results obtained with the weak learners they are made of.
If we rely on how they deal with the bias-variance trade-off we can consider distinct groups of ensemble methods:
Bagging
Boosting
Hybrid learners
Boosting or Stacking combine distinct predictors to yield a model with an increasingly smaller error, and so reduce the bias.
They differ on if do the combination
sequentially (1) or
using a meta-model (2) .
Hybrid techniques combine approaches in order to deal with both variance and bias.
The approach should be clear from their name:
Gradient Boosted Trees with Bagging
Stacked bagging
Decision trees suffer from high variance when compared with other methods such as linear regression, especially when \(n/p\) is moderately large.
Given that this is intrinsic to trees, Breiman (1996) sugested to build multiple trees derived from the same dataset and, somehow, average them.
Bagging relies, informally, on the idea that:
That is, relying on the sample mean instead of on simple observations, decreases variance by a factor of \(n\).
BTW this idea is still (approximately) valid for more general statistics where the CLT applies.
Two questions arise here:
\[\hat f_{bag}(x)=\frac 1B \sum_{b=1}^B \hat f^{*b}(x) \]
\[ \hat G_{bag}(x) = \arg \max_k \hat f_{bag}(x). \]
Since each out-of-bag set is not used to train the model, it can be used to evaluate performance.
Source: https://www.baeldung.com/cs/random-forests-out-of-bag-error
The example uses the AmesHousing
dataset on house prices in Ames, IA to predict the “Sales price” of houses.
We start by splitting the dataset in test/train subsets
Next we build a bagged tree on the train subset and evaluate it on the test subset.
Last we try to interpret the results using “Variable importance”
Bagging is equivalent to RandomForest if we use all the trees so the library randomForest
is used.
To measure feature importance, the reduction in the loss function (e.g., SSE) attributed to each variable at each split is tabulated.
The total reduction in the loss function across all splits by a variable are summed up and used as the total feature importance.
While growing a decision tree, during the bagging process,
Random forests perform split-variable randomization:
Source: AIML.com research
Random Forests Algorithm, from chapter 11 in (Boehmke and Greenwell 2020)
Random Forests Algorithm, from chapter 17 in (Hastie and Efron 2016).
# number of features
n_features <- length(setdiff(names(ames_train), "Sale_Price"))
# train a default random forest model
ames_rf1 <- ranger(
Sale_Price ~ .,
data = ames_train,
mtry = floor(n_features / 3),
respect.unordered.factors = "order",
seed = 123
)
# get OOB RMSE
(default_rmse <- sqrt(ames_rf1$prediction.error))
## [1] 24859.27
There are several parameters that, appropriately tuned, can improve RF performance.
1 & 2 usually have largest impact on predictive accuracy.
Grid Search: Systematically searches through (all possible combinations) a predefined grid of hyperparameter values to find the combination that maximizes performance.
Random Search: Randomly samples hyperparameter values from predefined distributions. Faster than Grid Search, but less prone to find optimum.
Model-Based Optimization leverages probabilistic models, often Gaussian processes, to model the objective function and iteratively guide the search for optimal hyperparameters.
The idea of improving weak learners by aggregation has moved historically along two distinct lines:
Idea: create a model that is better than any of its individual components by combining their strengths and compensating for their weaknesses.
For this, multiple weak models are trained sequentially, and each new model is trained to improve the errors made by the previous model.
The final model is a weighted combination of all the models, and the weights are determined by the accuracy of each model.
Boosting, like other Ensemble methods, improves the accuracy of weak learners and achieve better predictive performance than individual models.
Boosting also reduces overfitting by improving the generalization ability of models.
Available in many flavors,
Can be parallelized
Strong experience in Real world applications and industry.
Can be computationally expensive, especially when dealing with large datasets and complex models.
Can be sensitive to noisy data and outliers,
May not work well with certain types of data distributions.
Not so good as “out-of-the-box”: Requires careful tuning of hyperparameters to achieve optimal performance, which can be time-consuming and challenging.
AdaBoost (Adaptive Boosting) directly reflects what has been described above as “boosting”.
Its main objective is to Improve classification accuracy by combining multiple “weak learners”
In order to run AdaBoost it is required:
It is a generalization of the AdaBoost algorithm that allows the use of any cost function, as long as it is differentiable.
The flexibility of this algorithm has made it possible to apply boosting to a multitude of problems (regression, multiple classification, etc.), making it one of the most successful machine learning methods.
While there are various adaptations, the general idea behind all of them is the same: train models sequentially, with each model adjusting the residuals (errors) of the previous models.
Train a first weak learner \(f_1\), which predicts the response variable $ y $, and calculate the residuals \(y - f_1(x)\).
Then, train a new model \(f_2\), which tries to predict the residuals of the previous model, in other words, which tries to correct the errors made by model \(f_1\).
\(f_1(x) \approx y\)
\(f_2(x) \approx y - f_1(x)\)
In the next iteration, calculate the residuals of the two models together \(y - f_1(x) - f_2(x)\), the errors made by \(f_1\) that \(f_2\) has not been able to correct, and train a third model \(f_3\) to try to correct them.
\(f_3(x) \approx y - f_1(x) - f_2(x)\)
This process is repeated \(M\) times, so that each new model minimizes the residuals (errors) of the previous one.
Since the goal of Gradient Boosting is to minimize the residuals iteration by iteration, it is susceptible to overfitting.
One way to avoid this problem is by using a regularization value, also known as the learning rate ($ $), which limits the influence of each model on the ensemble.
As a result of this regularization, more models are needed to form the ensemble, but better results are achieved.
\(f_1(x) \approx y\)
\(f_2(x) \approx y - \lambda f_1(x)\)
\(f_3(x) \approx y - \lambda f_1(x) - \lambda f_2(x)\)
\(y \approx \lambda f_1(x) + \lambda f_2(x) + \lambda f_3(x) + \ldots + \lambda f_m(x)\)
Multiple extensions from Gradient Boosting.
XGBoost
LightGBM
Fraud Detection
Image and Speech Recognition
Anomaly Detection
Medical Diagnosis
Amazon’s recommendation engine
Models that predict protein structures from amino acid sequences
Pattern identification in fMRI brain scans.