6.3 Contrast Coding#
Contrast coding allows us to define and test custom comparisons, making it more flexible than methods like dummy or effects coding. By specifying contrasts, we can address specific research questions, such as comparing specific groups. This approach is particularly useful when we want to investigate targeted hypotheses that are not straightforwardly addressed by other coding schemes.
We will explore contrast coding by defining five distinct contrasts and applying them to our dataset. Each contrast corresponds to a separate research question, showcasing how this approach provides tailored insights into group differences.
Creating the Contrast Matrix#
Load and prepare the data
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Load the data, convert genotype to a categorical variable, and get the levels
df = pd.read_csv("data/alzheimers_data.txt", delimiter='\t').dropna()
df['genotype'] = df['genotype'].astype('category')
levels = df['genotype'].cat.categories.tolist()
print(levels)
['e2/e2', 'e2/e3', 'e2/e4', 'e3/e3', 'e3/e4', 'e4/e4']
Defining contrasts of interests and create the matrix
# levels: 'e2/e2', 'e2/e3', 'e2/e4', 'e3/e3', 'e3/e4', 'e4/e4'
contrast1 = np.array([-0.5,-0.5,0,0,0.5,0.5])
contrast2 = np.array([0,0,-0.5,-0.5,0.5,0.5])
contrast3 = np.array([-0.5,0,0,0.5,0,0])
contrast4 = np.array([-0.5,-0.5,-0.5,0.5,0.5,0.5])
contrast5 = np.array([0.5,0,0,0,0,-0.5])
contrast_matrix = np.column_stack([contrast1, contrast2, contrast3, contrast4, contrast5])
Contrast 1: Compare the mean
WMf
ofe2/e2
ande2/e3
against the mean ofe3/e4
ande4/e4
Contrast 2: Compare the mean
WMf
ofe2/e4
ande3/e3
against the mean ofe3/e4
ande4/e4
Contrast 3: Compare the mean
WMf
ofe2/e2
against the mean ofe3/e3
Contrast 4: Compare the mean
WMf
ofe2/e2
,e2/e3
, ande2/e4
against the mean ofe3/e3
,e3/e4
, ande4/e4
Contrast 5: Compare the mean
WMf
ofe2/e2
against the mean ofe4/e4
Creating the Regression Model#
# Create mapping for the contrast matrix
genotype_mapping = {"e2/e2": 0, "e2/e3": 1, "e2/e4": 2, "e3/e3": 3, "e3/e4": 4, "e4/e4": 5}
# Create the design matrix and outcome variable
X = np.array([contrast_matrix[genotype_mapping[genotype]] for genotype in df['genotype']])
X = sm.add_constant(X)
y = df['WMf']
# Fit the model
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: WMf R-squared: 0.052
Model: OLS Adj. R-squared: 0.032
Method: Least Squares F-statistic: 2.605
Date: Wed, 09 Apr 2025 Prob (F-statistic): 0.0257
Time: 08:53:03 Log-Likelihood: 219.06
No. Observations: 245 AIC: -426.1
Df Residuals: 239 BIC: -405.1
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.8176 0.015 55.866 0.000 0.789 0.846
x1 0.0606 0.072 0.839 0.402 -0.082 0.203
x2 -0.0956 0.064 -1.494 0.136 -0.222 0.030
x3 -0.0325 0.162 -0.201 0.841 -0.351 0.286
x4 -0.0282 0.088 -0.321 0.748 -0.201 0.145
x5 -0.0650 0.070 -0.928 0.354 -0.203 0.073
==============================================================================
Omnibus: 9.927 Durbin-Watson: 1.820
Prob(Omnibus): 0.007 Jarque-Bera (JB): 10.312
Skew: -0.502 Prob(JB): 0.00577
Kurtosis: 3.026 Cond. No. 35.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpreting the Outputs#
R-squared:
Only 5.2% of the variance in
WMf
is explained by the contrasts. This suggests that the model is fairly poor. The adjusted R-squared, which accounts for the number of predictors is even lower.
F-statistic:
The overall model is statistically significant, indicating that the contrasts collectively contribute to explaining
WMf
(p=0.0257).
Intercept (const):
Represents the grand mean of
WMf
across all genotypes.Value: 0.8176 (highly significant, p<0.001).
Contrast 1 (x1):
Compares the mean
WMf
ofe2/e2
+e2/e3
(Group 1) toe3/e4
+e4/e4
(Group 2).Coefficient: 0.06060, indicating that Group 1 has a slightly higher mean
WMf
than Group 2.p=0.402: This difference is not statistically significant.
All other contrasts are interpreted similarly.
Discussion:
The low R-squared shows the model does not capture much variance in
WMf
.Non-significant results for the contrasts imply that the hypothesized contrasts do not strongly impact
WMf
.
Summary
Contrast coding is a flexible tool for testing targeted hypotheses.