From AdaBoost to Gradient Boosting#

As we explored in the previous session, boosting refers to a class of ensemble methods that build predictive models sequentially, where each new model focuses on the errors made by previous ones. Historically, boosting was first introduced for classification in the form of AdaBoost. The same core idea can be extended to regression and leads naturally to gradient boosting, which is the main focus of this session.

This notebook therefore serves two purposes:

  1. Use AdaBoost for regression as an intuitive starting point

  2. Transition to gradient boosting for regression as the principled and modern formulation


Intuition: Weak regressors can form strong predictors#

Suppose we have a regression problem with inputs \(x_i\) and continuous targets \(y_i\).

A weak regressor is a model that performs only slightly better than a very naive baseline, such as predicting the mean of the target variable. On its own, such a model is not very useful. However, many weak regressors combined carefully can yield a strong predictive model.

The basic boosting idea is:

  1. Fit a simple model to the data

  2. Identify where this model performs poorly

  3. Encourage the next model to focus on these difficult observations

  4. Combine all models into a single predictor


AdaBoost for regression#

AdaBoost implements this idea by assigning weights to observations. Observations that are predicted poorly receive larger weights and therefore influence the next model more strongly.

We observe training data

\[\{(x_i, y_i)\}_{i=1}^N\]

and maintain observation weights

\[w_i^{(m)}\]

which change after each boosting iteration \(m\). Initially, all observations are weighted equally:

\[w_i^{(1)} = \frac{1}{N}\]

Choice of weak learner

In regression settings, AdaBoost typically relies on very simple base learners, such as

  • decision stumps (trees with depth one)

  • very shallow regression trees

The goal is not to fit the data well in a single step, but to make small, incremental improvements.

Measuring regression error

For regression, misclassification counts are no longer meaningful. Instead, prediction errors are measured using absolute deviations:

\[e_i = |y_i - \hat{y}_i|\]

To make errors comparable across observations, they are normalised:

\[L_i = \frac{e_i}{\max_j e_j}\]

so that \(0 \le L_i \le 1\). Observations with larger errors receive stronger penalties.

The AdaBoost.R2 algorithm#

For boosting rounds \(m = 1, \dots, M\):

  1. Fit a weak regressor \(T_m(x)\) using the current weights \(w_i^{(m)}\)

  2. Compute normalised errors \(L_i\)

  3. Compute the weighted error rate

\[\text{err}_m = \sum_{i=1}^N w_i^{(m)} L_i\]
  1. Compute the model weight

\[\alpha_m = \log\left(\frac{1 - \text{err}_m}{\text{err}_m}\right)\]
  1. Update observation weights

\[w_i^{(m+1)} = w_i^{(m)} \cdot \exp(\alpha_m L_i)\]
  1. Renormalise the weights so that they sum to one

Observations with large errors gain influence over subsequent regressors.

Final prediction

The final AdaBoost regression model is a weighted sum of weak regressors:

\[ \hat{f}(x) = \sum_{m=1}^M \alpha_m T_m(x) \]

Each regressor contributes according to its predictive performance.

Limitations#

While AdaBoost captures the idea of sequential error correction, it has several limitations in regression problems:

  • the definition and normalisation of errors is somewhat ad hoc

  • the method is sensitive to outliers

  • there is no explicit loss function being minimised

  • the optimisation perspective remains unclear

These limitations motivate a more principled approach to boosting.


From AdaBoost to gradient boosting#

Gradient boosting reformulates boosting as an explicit optimisation problem. Instead of reweighting observations, it constructs models that directly minimise a chosen loss function.

The key idea is to view boosting as gradient descent in function space.

Core idea#

  1. Start with a simple initial model, usually the mean of the target values

  2. Compute residuals between observed targets and current predictions

  3. Fit a small regression tree to these residuals

  4. Add this tree to the model, scaled by a learning rate

  5. Repeat

Each new tree explains structure that previous trees failed to capture.

After \(M\) iterations, the model can be written as

\[\hat{F}_M(x) = \sum_{m=1}^M \eta \, T_m(x)\]

where \(\eta\) denotes the learning rate.

Gradient boosting offers several advantages. It:

  • works naturally for regression problems

  • allows arbitrary differentiable loss functions

  • is less sensitive to outliers than AdaBoost

  • forms the basis of modern methods such as XGBoost and LightGBM

For these reasons, gradient boosting has largely replaced AdaBoost in regression settings.

Example#

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor

rng = np.random.RandomState(0)
X = np.linspace(0, 6, 200).reshape(-1,1)
y = np.sin(X).ravel() + rng.normal(scale=0.3, size=200)

gb = GradientBoostingRegressor(
      n_estimators=200,
      learning_rate=0.1,
      max_depth=2,
      random_state=0)
gb.fit(X, y)

X_plot = np.linspace(0, 6, 500).reshape(-1,1)
y_pred = gb.predict(X_plot)

fig, ax = plt.subplots()
ax.scatter(X, y, s=20, alpha=0.5)
ax.plot(X_plot, y_pred, color="black", linewidth=2)
ax.set(title="Gradient boosting regression", xlabel="x", ylabel="y");
../../_images/eb08462b00689f06143be9b1f23a2409dcfedeb686b158bb89f8a0a56cd9835a.png

Summary

  • AdaBoost introduces sequential error correction via observation weighting

  • This idea can be extended to regression but has limitations

  • Gradient boosting reformulates boosting as loss-function optimisation

  • Gradient boosting is the dominant practical approach for regression