6.3 Contrast Coding

6.3 Contrast Coding#

Contrast coding allows us to define and test custom comparisons, making it more flexible than methods like dummy or effects coding. By specifying contrasts, we can address specific research questions, such as comparing specific groups. This approach is particularly useful when we want to investigate targeted hypotheses that are not straightforwardly addressed by other coding schemes.

We will explore contrast coding by defining five distinct contrasts and applying them to our dataset. Each contrast corresponds to a separate research question, showcasing how this approach provides tailored insights into group differences.

Creating the Contrast Matrix#

Load and prepare the data

import numpy as np
import pandas as pd
import statsmodels.api as sm

# Load the data, convert genotype to a categorical variable, and get the levels
df = pd.read_csv("data/alzheimers_data.txt", delimiter='\t').dropna()
df['genotype'] = df['genotype'].astype('category')
levels = df['genotype'].cat.categories.tolist()
print(levels)

['e2/e2', 'e2/e3', 'e2/e4', 'e3/e3', 'e3/e4', 'e4/e4']

Defining contrasts of interests and create the matrix

# levels:            'e2/e2', 'e2/e3', 'e2/e4', 'e3/e3', 'e3/e4', 'e4/e4'
contrast1 = np.array([-0.5,-0.5,0,0,0.5,0.5])
contrast2 = np.array([0,0,-0.5,-0.5,0.5,0.5])
contrast3 = np.array([-0.5,0,0,0.5,0,0])
contrast4 = np.array([-0.5,-0.5,-0.5,0.5,0.5,0.5])
contrast5 = np.array([0.5,0,0,0,0,-0.5])

contrast_matrix = np.column_stack([contrast1, contrast2, contrast3, contrast4, contrast5])

Contrast 1: Compare the mean WMf of e2/e2 and e2/e3 against the mean of e3/e4 and e4/e4
Contrast 2: Compare the mean WMf of e2/e4 and e3/e3 against the mean of e3/e4 and e4/e4
Contrast 3: Compare the mean WMf of e2/e2 against the mean of e3/e3
Contrast 4: Compare the mean WMf of e2/e2, e2/e3, and e2/e4 against the mean of e3/e3, e3/e4, and e4/e4
Contrast 5: Compare the mean WMf of e2/e2 against the mean of e4/e4

Creating the Regression Model#

# Create mapping for the contrast matrix
genotype_mapping = {"e2/e2": 0, "e2/e3": 1, "e2/e4": 2, "e3/e3": 3, "e3/e4": 4, "e4/e4": 5}

# Create the design matrix and outcome variable
X = np.array([contrast_matrix[genotype_mapping[genotype]] for genotype in df['genotype']])
X = sm.add_constant(X)
y = df['WMf']

# Fit the model
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    WMf   R-squared:                       0.052
Model:                            OLS   Adj. R-squared:                  0.032
Method:                 Least Squares   F-statistic:                     2.605
Date:                Wed, 09 Apr 2025   Prob (F-statistic):             0.0257
Time:                        08:53:03   Log-Likelihood:                 219.06
No. Observations:                 245   AIC:                            -426.1
Df Residuals:                     239   BIC:                            -405.1
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.8176      0.015     55.866      0.000       0.789       0.846
x1             0.0606      0.072      0.839      0.402      -0.082       0.203
x2            -0.0956      0.064     -1.494      0.136      -0.222       0.030
x3            -0.0325      0.162     -0.201      0.841      -0.351       0.286
x4            -0.0282      0.088     -0.321      0.748      -0.201       0.145
x5            -0.0650      0.070     -0.928      0.354      -0.203       0.073
==============================================================================
Omnibus:                        9.927   Durbin-Watson:                   1.820
Prob(Omnibus):                  0.007   Jarque-Bera (JB):               10.312
Skew:                          -0.502   Prob(JB):                      0.00577
Kurtosis:                       3.026   Cond. No.                         35.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpreting the Outputs#

R-squared:
- Only 5.2% of the variance in WMf is explained by the contrasts. This suggests that the model is fairly poor. The adjusted R-squared, which accounts for the number of predictors is even lower.
F-statistic:
- The overall model is statistically significant, indicating that the contrasts collectively contribute to explaining WMf (p=0.0257).
Intercept (const):
- Represents the grand mean of WMf across all genotypes.
- Value: 0.8176 (highly significant, p<0.001).
Contrast 1 (x1):
- Compares the mean WMf of e2/e2 + e2/e3 (Group 1) to e3/e4 + e4/e4 (Group 2).
- Coefficient: 0.06060, indicating that Group 1 has a slightly higher mean WMf than Group 2.
- p=0.402: This difference is not statistically significant.
All other contrasts are interpreted similarly.
Discussion:
- The low R-squared shows the model does not capture much variance in WMf.
- Non-significant results for the contrasts imply that the hypothesized contrasts do not strongly impact WMf.

Summary

Contrast coding is a flexible tool for testing targeted hypotheses.