From AdaBoost to Gradient Boosting

From AdaBoost to Gradient Boosting#

As we explored in the previous session, boosting refers to a class of ensemble methods that build predictive models sequentially, where each new model focuses on the errors made by previous ones. Historically, boosting was first introduced for classification in the form of AdaBoost. The same core idea can be extended to regression and leads naturally to gradient boosting, which is the main focus of this session.

This notebook therefore serves two purposes:

Use AdaBoost for regression as an intuitive starting point
Transition to gradient boosting for regression as the principled and modern formulation

Intuition: Weak regressors can form strong predictors#

Suppose we have a regression problem with inputs \(x_i\) and continuous targets \(y_i\).

A weak regressor is a model that performs only slightly better than a very naive baseline, such as predicting the mean of the target variable. On its own, such a model is not very useful. However, many weak regressors combined carefully can yield a strong predictive model.

The basic boosting idea is:

Fit a simple model to the data
Identify where this model performs poorly
Encourage the next model to focus on these difficult observations
Combine all models into a single predictor

AdaBoost for regression#

AdaBoost implements this idea by assigning weights to observations. Observations that are predicted poorly receive larger weights and therefore influence the next model more strongly.

We observe training data

\[\{(x_i, y_i)\}_{i=1}^N\]

and maintain observation weights

\[w_i^{(m)}\]

which change after each boosting iteration \(m\). Initially, all observations are weighted equally:

\[w_i^{(1)} = \frac{1}{N}\]

Choice of weak learner

In regression settings, AdaBoost typically relies on very simple base learners, such as

decision stumps (trees with depth one)
very shallow regression trees

The goal is not to fit the data well in a single step, but to make small, incremental improvements.

Measuring regression error

For regression, misclassification counts are no longer meaningful. Instead, prediction errors are measured using absolute deviations:

\[e_i = |y_i - \hat{y}_i|\]

To make errors comparable across observations, they are normalised:

\[L_i = \frac{e_i}{\max_j e_j}\]

so that \(0 \le L_i \le 1\). Observations with larger errors receive stronger penalties.

The AdaBoost.R2 algorithm#

For boosting rounds \(m = 1, \dots, M\):

Fit a weak regressor \(T_m(x)\) using the current weights \(w_i^{(m)}\)
Compute normalised errors \(L_i\)
Compute the weighted error rate

\[\text{err}_m = \sum_{i=1}^N w_i^{(m)} L_i\]

Compute the model weight

\[\alpha_m = \log\left(\frac{1 - \text{err}_m}{\text{err}_m}\right)\]

Update observation weights

\[w_i^{(m+1)} = w_i^{(m)} \cdot \exp(\alpha_m L_i)\]

Renormalise the weights so that they sum to one

Observations with large errors gain influence over subsequent regressors.

Final prediction

The final AdaBoost regression model is a weighted sum of weak regressors:

\[ \hat{f}(x) = \sum_{m=1}^M \alpha_m T_m(x) \]

Each regressor contributes according to its predictive performance.

Limitations#

While AdaBoost captures the idea of sequential error correction, it has several limitations in regression problems:

the definition and normalisation of errors is somewhat ad hoc
the method is sensitive to outliers
there is no explicit loss function being minimised
the optimisation perspective remains unclear

These limitations motivate a more principled approach to boosting.