Model Selection#

In neurocognitive psychology, we often collect huge amounts of data — sometimes thousands of measurements from different brain regions or many behavioral scores. Having more information can, in theory, help us make better predictions, but it also brings risks: too many variables can make analyses infeasible, lead to false discoveries, or cause models to learn noise instead of real effects.

The large p issue#

Big data refers to large data sets with many predictors which often cannot be processed or analyzed using traditional data processing techniques. For our prediction models, this brings some issues:

  • While the linear model can in theory still be used for such data, the ordinary least squares fit becomes infeasible, especially when p > n

  • The large amount of features reduce interpretability

This is where linear model selection becomes essential, offering techniques to refine our models and extract meaningful insights from high-dimensional data.

Today’s data: Hitters#

For practical demonstration, we will use the Hitters dataset. This dataset provides Major League Baseball Data from the 1986 and 1987 seasons. It contains 322 observations of major league players on 20 variables (so it’s not big data, but we can pretend it is). The Research aim is to predict a baseball player’s salary on the basis of various predictors associated with the performance in the previous year. You can check its contents here: https://islp.readthedocs.io/en/latest/datasets/Hitters.html

import statsmodels.api as sm 

# Get the data
hitters = sm.datasets.get_rdataset("Hitters", "ISLR").data

For computational reasons, we will not include all predictors but only a smaller subset:

# Keep a total of 10 variables - the target ´Salary´ and 9  features.
hitters_subset = hitters[["Salary", "CHits", "CAtBat", "CRuns", "CWalks", "Assists", "Hits", "HmRun", "Years", "Errors"]].copy()

# Remove rows with missing vlaues
hitters_subset.dropna(inplace=True)

hitters_subset.head()
Salary CHits CAtBat CRuns CWalks Assists Hits HmRun Years Errors
rownames
-Alan Ashby 475.0 835 3449 321 375 43 81 7 14 10
-Alvin Davis 480.0 457 1624 224 263 82 130 18 3 14
-Andre Dawson 500.0 1575 5628 828 354 11 141 20 11 3
-Andres Galarraga 91.5 101 396 48 33 40 87 10 2 4
-Alfredo Griffin 750.0 1133 4408 501 194 421 169 4 11 25

Let’s also take a look at the correlation matrix to check for potential multicollinearity, which can affect the stability of linear regression models.

import seaborn as sns

sns.heatmap(hitters_subset.corr(), annot=True, cmap="coolwarm", fmt=".2f");
../../_images/c502775220d48ee31ae1531950ee5cc1aaa964472b8ac1d7a7e03cf8a5e08de8.png

The heatmap reveals strong correlations between several predictors:

  • CHits and CAtBat show a correlation of 1,

  • CHits and CRuns have a very strong correlation of 0.98,

We thus remove two of the correlated features:

features_drop = ["CRuns", "CAtBat"]
hitters_subset = hitters_subset.drop(columns=features_drop)

Handling big data in linear models#

Handling big data

To handle large datasets efficiently in linear modeling, we will introduce three methods:

  • Subset Selection

  • Dimensionality Reduction

  • Regularization / Shrinkage

Subset Selection#

In subset selection we identify a subset of p predictos that are truly related to the outcome. The model get fitted using least squares on the reduces set of variables.

How do we determine which variables are relevant?!

Best Subset Selection#

We will start with performing Best Subset Selection (also called exhaustive search) as implemented in the mlxtend package. It has great documentation, e.g. for the exhaustive search. In short, this approach is a brute-force evaluation of feature subsets. A specific performance metric (e.g. MSE, R², or accuracy) is optimized given an arbitrary regressor or classifier. For example, if we have 4 features, the algorithm will evaluate all 15 possible combinations of features.

Before we start the subset selection, we first define the target and the features:

import numpy as np

y = hitters_subset["Salary"]
X = hitters_subset.drop("Salary", axis=1)

We first split our data into training and test dataset. Although the selection function uses cross-validation to identify the best subset of predictors (Step 3), this evaluation is done during the selection process and can still overfit to the data. To fairly assess how the final model performs on new data, we split off a test set and use it only after feature selection is complete.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Purpose

What is it for?

When?

Cross Validation in selection function

Helps choose the best subset of features

During selection

Test Set Evaluation

Checks how well the final model performs

After selection

We then run Best Subset Selection on the training data:

from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import ExhaustiveFeatureSelector

efs = ExhaustiveFeatureSelector(
        estimator=LinearRegression(),
        min_features=1,
        max_features=X_train.shape[1],
        scoring='r2',
        cv=5,
        print_progress=False)

efs.fit(X_train, y_train)

print('Best R²: %.2f' % efs.best_score_)
print('Best subset (indices):', efs.best_idx_)
print('Best subset (corresponding names):', efs.best_feature_names_)
Best R²: 0.41
Best subset (indices): (1, 3, 4)
Best subset (corresponding names): ('CWalks', 'Hits', 'HmRun')

If you are interested in the details, they are stored in the metric dictionary:

import pandas as pd

df = pd.DataFrame.from_dict(efs.get_metric_dict()).T
df.sort_values('avg_score', inplace=True, ascending=False)
df
feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err
47 (1, 3, 4) [0.38067374180608526, 0.463839312906159, 0.505... 0.412539 (CWalks, Hits, HmRun) 0.100218 0.077973 0.038987
90 (1, 3, 4, 6) [0.3878486031208028, 0.4634739461791568, 0.493... 0.408445 (CWalks, Hits, HmRun, Errors) 0.094497 0.073522 0.036761
83 (1, 2, 3, 4) [0.3813017586747449, 0.4658246874129488, 0.506... 0.404694 (CWalks, Assists, Hits, HmRun) 0.098463 0.076607 0.038304
89 (1, 3, 4, 5) [0.33058568684405754, 0.46099452778892, 0.5147... 0.403253 (CWalks, Hits, HmRun, Years) 0.105549 0.082121 0.041061
114 (1, 2, 3, 4, 6) [0.3751200780670817, 0.4651398463048052, 0.496... 0.401198 (CWalks, Assists, Hits, HmRun, Errors) 0.095867 0.074588 0.037294
... ... ... ... ... ... ... ...
58 (2, 5, 6) [0.14956857373046695, 0.13751233647954653, 0.1... 0.122419 (Assists, Years, Errors) 0.120381 0.093661 0.04683
5 (5,) [0.14935846878458914, 0.1343083510706763, 0.12... 0.122124 (Years,) 0.135551 0.105464 0.052732
2 (2,) [-0.0025851222923383155, -0.007245558244625583... -0.040331 (Assists,) 0.04707 0.036622 0.018311
6 (6,) [-0.02226936534097579, -0.006949819455392525, ... -0.051379 (Errors,) 0.045878 0.035695 0.017847
21 (2, 6) [-0.04083051774641855, -0.007183353236134726, ... -0.057133 (Assists, Errors) 0.042413 0.032998 0.016499

127 rows × 7 columns

Forward Stepwise Selection#

Forward Stepwise Selection is a greedy feature‐selection method that starts with no features and, at each step, adds the single feature whose inclusion most improves the model’s performance. This “add the best remaining feature” is repeated until a desired number of features is reached, at which point the algorithm stops and returns the best subset.

An important parameter is k_features, which determines the number of features to select. We can, pass an integer (must be less than the total number of available features), a tuple (the algorithm will then evaluate all subset sizes between the min and max value), or one of two string options ("best", or "parsimonious"). Please refer to the documentation for further details.

from mlxtend.feature_selection import SequentialFeatureSelector

sfs_forward = SequentialFeatureSelector(
    estimator=LinearRegression(),
    k_features=(1, X_train.shape[1]),
    forward=True,
    scoring='r2',
    cv=5,
    verbose=0)
    
sfs_forward.fit(X_train, y_train)

print(f">> Forward SFS:")
print(f"   Best CV R²      : {sfs_forward.k_score_:.3f}")
print(f"   Optimal # feats : {len(sfs_forward.k_feature_idx_)}")
print(f"   Feature indices : {sfs_forward.k_feature_idx_}")
print(f"   Feature names   : {sfs_forward.k_feature_names_}")
>> Forward SFS:
   Best CV R²      : 0.413
   Optimal # feats : 3
   Feature indices : (1, 3, 4)
   Feature names   : ('CWalks', 'Hits', 'HmRun')

You can see we ended up with the same three predictors as in best subset selection: CWalks, Hits, HmRun. However, this is not necessarily always the case — best subset and stepwise selection can, and often do, lead to different results. In our case we only had a small number of predictors, which makes it more likely to end up with the same subset.

Backward Stepwise Selection#

Backward Stepwise Selection begins with the full feature set and, at each step, removes the single feature whose exclusion most improves (or least harms) model performance. We keep repeating this “remove the worst feature” step until only the desired number of features remains, and then the algorithm returns that reduced subset:

sfs_backward = SequentialFeatureSelector(
    estimator=LinearRegression(),
    k_features=(1, X_train.shape[1]),
    forward=False,
    floating=False,
    scoring='r2',
    cv=5,
    verbose=0)

sfs_backward.fit(X_train, y_train)

print(f"<< Backward SFS:")
print(f"   Best CV R²      : {sfs_backward.k_score_:.3f}")
print(f"   Optimal # feats : {len(sfs_backward.k_feature_idx_)}")
print(f"   Feature indices : {sfs_backward.k_feature_idx_}")
print(f"   Feature names   : {sfs_backward.k_feature_names_}")
<< Backward SFS:
   Best CV R²      : 0.413
   Optimal # feats : 3
   Feature indices : (1, 3, 4)
   Feature names   : ('CWalks', 'Hits', 'HmRun')

Subset Selection Summary#

Best Subset Selection

Forward Stepwise Selection

Backward Stepwise Selection

+ will find the best model

- not guaranteed to find best model

- not guaranteed to find best model

- may overfit with large p

+ possible to use when p is very large

+ possible to use when p is very large, given p<n

- computationally very expensive

+ computationally less demanding

+ computationally less demanding

What next?#

Once we have identified the features that are relevant for predicting the outcome, let`s evaluate the model performance and estimate true test error with the thee predictors identified by Best Subset Selection and Forward Stepwise Seletion.

import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression

selected_features = list(sfs_backward.k_feature_names_)

# Subset the data
X_train_subset = X_train[selected_features]
X_test_subset = X_test[selected_features]

# Fit the model
model = LinearRegression()
model.fit(X_train_subset, y_train)

# Get predictions anf performance
y_pred = model.predict(X_test_subset)

mse_test = mean_squared_error(y_test, y_pred)
rmse_test = np.sqrt(mse_test)
r2_test = r2_score(y_test, y_pred)

print(f"Test MSE:  {mse_test:.2f}")
print(f"Test RMSE: {rmse_test:.2f}")
print(f"Test R²:   {r2_test:.4f}")
Test MSE:  182321.00
Test RMSE: 426.99
Test R²:   0.2492

So in sum:

  • On average, our predictions deviate from the actual salary by about 426 thousand dollars.

  • Our model explains ~25% of the variance in salary.

Regularization and Dimensionality Reduction#

As mentioned before, regularization and dimensionality reduction are two other measures of dealing with large numbers of predictors. Regularization techniques will be introduced in the next session, and dimensionality reduction will be introduced in the Principal Component Analysis session.