6.2 Effects Coding#

Unweighted Effects Coding#

For dummy coding, we used the e4/e4 genotype as the reference category. The intercept represented the mean of the reference category, and the coefficients for other groups represented the difference in means between each group and the reference.

The difference in unweighted effects coding now is that the reference is the overall mean of all groups rather than a specific group. The intercept then represents the grand mean (overall mean of the dependent variable, WMf), while the coefficients for each group represent the deviation of that group’s mean from the overall mean.

We can implement unweighted effects coding similarly to dummy coding but we will use Sum instead of Treatment for the contrast.

import numpy as np
import pandas as pd
from patsy.contrasts import Sum
import statsmodels.formula.api as smf

# Load the dataset
df = pd.read_csv("data/alzheimers_data.txt", delimiter='\t').dropna()

# Convert genotype into a categorical variable
df['genotype'] = df['genotype'].astype('category')

# Create and fit the model
model = smf.ols('WMf ~ C(genotype, Sum)', data=df)
results = model.fit()

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    WMf   R-squared:                       0.052
Model:                            OLS   Adj. R-squared:                  0.032
Method:                 Least Squares   F-statistic:                     2.605
Date:                Wed, 09 Apr 2025   Prob (F-statistic):             0.0257
Time:                        08:53:01   Log-Likelihood:                 219.06
No. Observations:                 245   AIC:                            -426.1
Df Residuals:                     239   BIC:                            -405.1
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept                     0.8176      0.015     55.866      0.000       0.789       0.846
C(genotype, Sum)[S.e2/e2]    -0.0325      0.060     -0.544      0.587      -0.150       0.085
C(genotype, Sum)[S.e2/e3]    -0.0162      0.020     -0.809      0.419      -0.056       0.023
C(genotype, Sum)[S.e2/e4]     0.0619      0.031      2.001      0.047       0.001       0.123
C(genotype, Sum)[S.e3/e3]     0.0175      0.016      1.083      0.280      -0.014       0.049
C(genotype, Sum)[S.e3/e4]    -0.0316      0.019     -1.660      0.098      -0.069       0.006
==============================================================================
Omnibus:                        9.927   Durbin-Watson:                   1.820
Prob(Omnibus):                  0.007   Jarque-Bera (JB):               10.312
Skew:                          -0.502   Prob(JB):                      0.00577
Kurtosis:                       3.026   Cond. No.                         12.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

You can see, that the result are fairly similar to dummy coding, as the grand mean is close to the e4/e4 mean.

When it comes to the contrast matrix, it looks pretty similar, with the only distinction being last row coded as -1. In effect coding, there is is no explicit reference group as seen in dummy coding - the grand mean serves as the reference. The group coded with -1 is central to this coding scheme but doesn’t act as a conventional reference category for comparisons.

# Get all genotype levels and save them as a list
levels = df['genotype'].cat.categories.tolist()

# Create the contrast matrix
contrast = Sum().code_without_intercept(levels)

print("Levels:", levels)
print("Contrast Matrix:\n", contrast.matrix)
Levels: ['e2/e2', 'e2/e3', 'e2/e4', 'e3/e3', 'e3/e4', 'e4/e4']
Contrast Matrix:
 [[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]
 [-1. -1. -1. -1. -1.]]

In the matrix, each row corresponds to a level of the categorical variable. The coding scheme uses -1 for all variables in the last row to enforce the constraint that the sum of the coefficients is zero. This is the key difference from dummy coding. The columns of the matrix correspond to the levels of the categorical variable, excluding the last one (because the last level is redundant due to the zero-sum constraint).

Weighted Effects Coding#

While unweighted effects coding uses the grand mean (unweighted average of all groups) as the reference, weighted effects coding modifies this approach to use the weighted mean of the dependent variable as the reference. The weighted mean accounts for the group sizes, giving more weight to groups with larger sample sizes.

This approach is particularly useful when group sizes differ significantly, as it ensures the comparison is more representative of the overall data distribution. The intercept in weighted effects coding represents the weighted mean of the dependent variable (WMf), while the coefficients for each group represent the deviation of that group’s mean from the weighted mean. As this functionality is not directly offered, we will manually create both the contrast matrix and the design matrix by performing the following steps:

  1. Computing the sample proportions for each category in the categorical variable

# calculating the counts of unique genotype levels in the column 'genotype'
genotype_counts = df['genotype'].value_counts(sort=False)

# Extracting the numerical counts(frequency) associated with each genotype level
counts = genotype_counts.values

print("Genotype Levels:", levels)       
print("Counts:", counts)
Genotype Levels: ['e2/e2', 'e2/e3', 'e2/e4', 'e3/e3', 'e3/e4', 'e4/e4']
Counts: [  2  36   9 143  45  10]
  1. Use these counts to create custom weights for the reference category

contrast_matrix = {
    "e2/e2": np.array([1, 0, 0, 0, 0]),
    "e2/e3": np.array([0, 1, 0, 0, 0]),
    "e2/e4": np.array([0, 0, 1, 0, 0]),
    "e3/e3": np.array([0, 0, 0, 1, 0]),
    "e3/e4": np.array([0, 0, 0, 0, 1]),
    "e4/e4": -counts[:-1] / counts[-1]
}

# Print each genotype's corresponding contrast vector
for key, value in contrast_matrix.items():
    print(f"{key}: {value}")
e2/e2: [1 0 0 0 0]
e2/e3: [0 1 0 0 0]
e2/e4: [0 0 1 0 0]
e3/e3: [0 0 0 1 0]
e3/e4: [0 0 0 0 1]
e4/e4: [ -0.2  -3.6  -0.9 -14.3  -4.5]
  1. Create the weighted effects coding design matrix and outcome vector

import statsmodels.api as sm

# Build the design matrix (X)
X = np.array([contrast_matrix[genotype] for genotype in df['genotype']])

# Add intercept
X = sm.add_constant(X)  

print(X)
print("Design matrix shape:", X.shape)
print("Column sums:", np.round(np.sum(X, axis=0), 2))
print("e3/e3 column:\n", X[:,4])

# Define the target vector (outcome variable)
y = df['WMf']
[[  1.   -0.2  -3.6  -0.9 -14.3  -4.5]
 [  1.    0.    0.    0.    1.    0. ]
 [  1.    0.    1.    0.    0.    0. ]
 ...
 [  1.    0.    0.    0.    0.    1. ]
 [  1.    0.    0.    0.    1.    0. ]
 [  1.    0.    0.    0.    1.    0. ]]
Design matrix shape: (245, 6)
Column sums: [245.  -0.  -0.  -0.   0.   0.]
e3/e3 column:
 [-14.3   1.    0.    1.    1.    1.    1.    1.    0.    1.    0.    0.
   0.  -14.3   1.    0.    0.    1.    0.    0.    0.    0.    1.  -14.3
   1.    1.    1.    1.    0.  -14.3   1.    1.    1.  -14.3   1.  -14.3
   1.    1.    0.    1.    0.    0.    0.    1.    1.    1.    1.    1.
   1.    0.    1.    1.    1.  -14.3   0.    0.    1.    0.    1.    0.
   1.    1.    0.    1.    1.    1.    0.    0.    0.    1.    1.    1.
   0.    1.    1.    1.    1.    0.    0.    0.    0.    1.    1.    1.
   0.    0.    1.    0.    1.    1.    1.    1.    1.    1.    1.    0.
   0.    0.    1.    1.    0.    1.    0.    0.    1.    0.    1.    1.
   1.    0.    0.    1.    0.    1.    0.    1.    1.    1.    0.    1.
   0.    0.    1.    0.    1.    1.    1.    1.    0.    1.    0.    1.
   1.    0.    0.    1.    1.    1.    0.    1.    1.    0.    0.    1.
   1.    0.    1.    0.    1.    1.    1.    1.    1.    0.    1.    1.
   0.    1.    1.    1.    1.    1.    0.    1.    1.    0.    1.    1.
   0.    1.    1.    0.    1.    0.    0.    1.    0.    1.    1.    1.
 -14.3   0.    1.    0.    0.    1.    1.    1.    0.    0.    1.    1.
   1.    0.    1.    1.    0.    1.    1.    1.    0.    0.    0.    0.
   1.  -14.3   1.    1.    0.    1.    0.    0.    0.    0.    1.    1.
   1.    0.    1.    0.    0.    1.    1.    1.    0.    1.    1.    0.
   1.    1.    1.    0.    1.    1.    1.    1.  -14.3   0.    1.    0.
   1.    0.    0.    1.    1. ]

We added some print statements to see what is going on inside the design matrix, and if everything is correct:

  • Shape: (245, 6)

    • 245 rows: Matches the number of observations in your dataset (df)

    • 6 columns: Matches the expected design matrix structure:

      • Intercept (constant column)

      • 5 contrast-coded columns for the categorical variable genotype

  • Column sums: [245. -0. -0. -0. 0. 0.]

    • The sum of the first column is 245, which corresponds to the number of observations (all intercept values are 1)

    • The sums of the other columns are 0, satisfying the sum-to-zero constraint of weighted effects coding

  • Inspecting a single column

    • The fifth column (e3/e3) contains a mix of 1, 0, and -14.3 values

      • 1 indicates the observation belongs to this genotype

      • 0 indicates the observation does not belong to this genotype

      • -14.3 indicates the observation belongs to last group (e4/e4) and codes its contribution to the weighted mean, accounting for the imbalance in group sizes to maintain the sum-to-zero constraint

  1. Create and fit the model. Note that we now use OLS() from statsmodels.api instead of ols() from statsmodels.formula.api, as we do not provide a formula but define the regression model in a mathematical way through the design matrix:

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    WMf   R-squared:                       0.052
Model:                            OLS   Adj. R-squared:                  0.032
Method:                 Least Squares   F-statistic:                     2.605
Date:                Wed, 09 Apr 2025   Prob (F-statistic):             0.0257
Time:                        08:53:01   Log-Likelihood:                 219.06
No. Observations:                 245   AIC:                            -426.1
Df Residuals:                     239   BIC:                            -405.1
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.8216      0.006    128.363      0.000       0.809       0.834
x1            -0.0365      0.071     -0.518      0.605      -0.176       0.102
x2            -0.0203      0.015     -1.313      0.190      -0.051       0.010
x3             0.0579      0.033      1.765      0.079      -0.007       0.122
x4             0.0134      0.005      2.483      0.014       0.003       0.024
x5            -0.0357      0.013     -2.645      0.009      -0.062      -0.009
==============================================================================
Omnibus:                        9.927   Durbin-Watson:                   1.820
Prob(Omnibus):                  0.007   Jarque-Bera (JB):               10.312
Skew:                          -0.502   Prob(JB):                      0.00577
Kurtosis:                       3.026   Cond. No.                         35.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpreting the Output#

  1. Intercept:

    • In weighted effects coding, the intercept represents the weighted mean of WMf, not the grand mean.

    • The weighted mean is calculated by giving each group a weight proportional to its size:

    \[\text{Weighted Mean} = \frac{\sum (\text{Group Size} \times \text{Group Mean})}{\sum (\text{Group Size})}\]
  2. Coefficients:

    • Each coefficient represents the deviation of the group mean from the weighted mean, taking group sizes into account.

    • Larger groups have more influence on the weighted mean, so coefficients for smaller groups may differ more significantly compared to unweighted effects coding.

Summary

  • Unweighted effects coding compares all groups to the grand mean, which is the unweighted average of the dependent variable. The intercept represents the grand mean, serving as the baseline for interpretation.

  • Weighted effects coding compares all groups to the weighted mean, which accounts for group sizes. The negative values in the design matrix reflect proportional adjustments needed to satisfy the sum-to-zero constraint.

Categorical Regression - Method Summary

Coding

Code RC

Intercept \(b_0\)

Slope \(b_j\)

Use if

Dummy

0

mean of refernce category(RC)

difference between the mean of RC and the other categories

When one category should be compared to all others.

Unweighted Effect

-1

unweighted mean across all categegories

difference between the unweighted mean and the effect for each category (\(b_j\))

When interested in comparing categories assuming equal group sizes.

Weighted Effect

\(\frac{n_j}{n_{RC}}\)

weighted mean across all categories

difference between the weighted mean and the effect for each category(\(b_j\))

When interested in comparing categories with unequal group sizes, accounting for the group sizes.