11.1 CFA#

Today, we will work with the HolzingerSwineford1939 dataset, which contains mental ability test scores from seventh- and eighth-grade pupils in two schools. We focus on nine observed variables:

  • x1, x2, x3: visual ability

  • x4, x5, x6: text processing skills

  • x7, x8, x9: speed ability

Like in the Path Modelling session, will use the semopy package. However, we will use a new operator =~, which allows us to define latent variables directly via their observed indicators, making CFA and SEM models easy to specify and read.

We first load and inspect the data:

import semopy

data = semopy.examples.holzinger39.get_data()
data
id sex ageyr agemo school grade x1 x2 x3 x4 x5 x6 x7 x8 x9
1 1 1 13 1 Pasteur 7.0 3.333333 7.75 0.375 2.333333 5.75 1.285714 3.391304 5.75 6.361111
2 2 2 13 7 Pasteur 7.0 5.333333 5.25 2.125 1.666667 3.00 1.285714 3.782609 6.25 7.916667
3 3 2 13 1 Pasteur 7.0 4.500000 5.25 1.875 1.000000 1.75 0.428571 3.260870 3.90 4.416667
4 4 1 13 2 Pasteur 7.0 5.333333 7.75 3.000 2.666667 4.50 2.428571 3.000000 5.30 4.861111
5 5 2 12 2 Pasteur 7.0 4.833333 4.75 0.875 2.666667 4.00 2.571429 3.695652 6.30 5.916667
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
297 346 1 13 5 Grant-White 8.0 4.000000 7.00 1.375 2.666667 4.25 1.000000 5.086957 5.60 5.250000
298 347 2 14 10 Grant-White 8.0 3.000000 6.00 1.625 2.333333 4.00 1.000000 4.608696 6.05 6.083333
299 348 2 14 3 Grant-White 8.0 4.666667 5.50 1.875 3.666667 5.75 4.285714 4.000000 6.00 7.611111
300 349 1 14 2 Grant-White 8.0 4.333333 6.75 0.500 3.666667 4.50 2.000000 5.086957 6.20 4.388889
301 351 1 13 5 Grant-White NaN 4.333333 6.00 3.375 3.666667 5.75 3.142857 4.086957 6.95 5.166667

301 rows × 15 columns

We can then simply fit a CFA model by using the standard string-based syntax to define three latent variables (visual, text and speed):

desc = '''visual =~ x1 + x2 + x3
          text =~ x4 + x5 + x6
          speed =~ x7 + x8 + x9'''

model = semopy.Model(desc)
results = model.fit(data)
print(results)
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.283
Number of iterations: 28
Params: 0.554 0.731 1.113 0.926 1.180 1.083 0.383 0.174 0.262 0.980 0.808 0.408 0.550 1.133 0.844 0.371 0.446 0.356 0.800 0.488 0.566

Model estimates#

We can inspect the estimated model parameters using:

estimates = model.inspect(std_est=True)
print(estimates)
      lval  op    rval  Estimate  Est. Std  Std. Err    z-value   p-value
0       x1   ~  visual  1.000000  0.771372         -          -         -
1       x2   ~  visual  0.554421  0.424031  0.099727   5.559413       0.0
2       x3   ~  visual  0.730526  0.581638   0.10918   6.691009       0.0
3       x4   ~    text  1.000000  0.851665         -          -         -
4       x5   ~    text  1.113076  0.855141  0.065392  17.021522       0.0
5       x6   ~    text  0.926120  0.838083  0.055425  16.709493       0.0
6       x7   ~   speed  1.000000  0.569252         -          -         -
7       x8   ~   speed  1.179980  0.722802  0.165045   7.149459       0.0
8       x9   ~   speed  1.082517  0.665275  0.151354   7.152197       0.0
9    speed  ~~   speed  0.383377  1.000000  0.086171   4.449045  0.000009
10   speed  ~~    text  0.173603  0.283220  0.049316   3.520223  0.000431
11   speed  ~~  visual  0.262135  0.470895  0.056252   4.659977  0.000003
12    text  ~~    text  0.980034  1.000000  0.112145   8.739002       0.0
13  visual  ~~  visual  0.808310  1.000000  0.145287   5.563548       0.0
14  visual  ~~    text  0.408277  0.458718  0.073527    5.55273       0.0
15      x1  ~~      x1  0.550161  0.404985  0.113439    4.84983  0.000001
16      x2  ~~      x2  1.133391  0.820197  0.101711  11.143202       0.0
17      x3  ~~      x3  0.843731  0.661698  0.090625    9.31016       0.0
18      x4  ~~      x4  0.371117  0.274667  0.047712   7.778264       0.0
19      x5  ~~      x5  0.446208  0.268734  0.058387   7.642264       0.0
20      x6  ~~      x6  0.356171  0.297617   0.04303   8.277334       0.0
21      x7  ~~      x7  0.799708  0.675952  0.081387   9.825966       0.0
22      x8  ~~      x8  0.487934  0.477557  0.074167   6.578856       0.0
23      x9  ~~      x9  0.565804  0.557409  0.070757   7.996483       0.0

Loadings#

The first nine rows correspond to the factor loadings, indicating how strongly each observed variable relates to its latent factor.

  • One loading per latent variable is fixed to 1 in the unstandardised solution in order to identify (scale) the factor.

  • The Estimate column contains the unstandardised loading.

  • The Est. Std column contains the standardised loading.

  • Std. Err represents the standard error (uncertainty) of the unstandardised estimate.

  • z-value is the unstandardised estimate divided by its standard error.

  • p-value tests the null hypothesis that the parameter equals zero in the population.

  • Large absolute z-values and small p-values indicate that a loading is reliably different from zero.

Standardised loadings (Est. Std) can be interpreted similarly to regression coefficients or correlations. For example, a standardised loading of 0.77 indicates that an increase of one standard deviation in the latent factor is associated with an increase of 0.77 standard deviations in the observed variable. Standardised loadings thus are particularly useful for comparing the relative strength of indicators within and across factors.

Variances and Covariances of Latent Variables#

Rows such as visual ~~ visual, text ~~ text, and speed ~~ speed represent the variances of the latent variables. Because all latent variables in this CFA model are exogenous (they are not predicted by other variables), these parameters are interpreted as variances.

  • In the unstandardised solution, these values depend on the scaling of the latent variables.

  • In the standardised solution, all latent variances are equal to 1 by definition.

Rows such as visual ~~ text, visual ~~ speed, and text ~~ speed represent covariances between latent variables, reflecting how strongly the constructs are associated with each other. The positive and statistically significant covariances indicate that the latent abilities are positively related under the specified model.

  • In the unstandardised solution, these are raw covariances.

  • In the standardised solution, these values correspond to latent correlations.

Residual Variances of Observed Variables#

The final nine rows correspond to the residual variances of the observed variables. The standardised column shows the standardised residual variance, that is, the proportion of variance in the observed variable that is not explained by the latent factor.

For example, a standardised residual variance of 0.40 means that 40% of the variance in the observed variable remains unexplained, while 60% is explained by the latent factor.

In practice, residual variances are almost always greater than zero, because latent variables rarely account for all variability in observed measures. Their presence reflects measurement error and construct-irrelevant variance.

Learning break

  1. How can you calculate the z-value yourself?

  2. When should you read a variable ~~ variable output as variance? When instead as residual variance?

Click to show solution
  1. The z-value is computed as the Estimate divided by the SE.

  2. Whether variable ~~ variable is interpreted as variance or residual variance depends on whether the variable is predicted in the model:

    • If the variable is exogenous (no incoming regression paths) it is the variance

    • If the variable is endogenous (it has predictors), it is the residual variance

Fit measures#

To assess model fit, semopy provides us with a wide range of fit measures:

stats = semopy.calc_stats(model)
print(stats.T)
                      Value
DoF            2.400000e+01
DoF Baseline   3.600000e+01
chi2           8.530573e+01
chi2 p-value   8.501896e-09
chi2 Baseline  9.188516e+02
CFI            9.305594e-01
GFI            9.071605e-01
AGFI           8.607407e-01
NFI            9.071605e-01
TLI            8.958391e-01
RMSEA          9.227505e-02
AIC            4.143318e+01
BIC            1.192825e+02
LogLik         2.834077e-01
  • χ² test (chi2, chi2 p-value)
    The chi-square test evaluates the null hypothesis that the model-implied covariance matrix equals the observed covariance matrix. A significant p-value indicates that the model does not reproduce the observed covariances exactly. However, the χ² test is highly sensitive to sample size, and significant results are common even for reasonably fitting models.

  • CFI (Comparative Fit Index)
    CFI compares the specified model to a baseline model that assumes no relationships among variables. Values closer to 1 indicate better fit. A CFI of 0.93 suggests an acceptable to good fit.

  • TLI (Tucker–Lewis Index)
    TLI is similar to CFI but penalises model complexity. Values closer to 1 indicate better relative fit. A TLI of 0.90 is often considered acceptable, though values above 0.95 are sometimes recommended.

  • RMSEA (Root Mean Square Error of Approximation)
    RMSEA summarises the degree of model misfit per degree of freedom, adjusting for model complexity. Values below 0.08 are often interpreted as acceptable, and values below 0.05 as good. An RMSEA of 0.09 indicates a mediocre fit.

  • Log-likelihood (LogLik)
    The log-likelihood reflects how likely the observed data are under the specified model and is used to compute information criteria.

  • AIC and BIC
    The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) balance model fit and complexity. Lower values indicate a better trade-off between goodness of fit and parsimony. These measures are useful only for comparing models, not as absolute indicators of fit.

Visualizing the Model#

For visualization, we can plot our model specified model using the following code.

semopy.semplot(model, plot_covs = True, std_ests=True, filename='data/cfa_plot.pdf')
../../../_images/fc23176842d4aea0500e76120c3c8cdf17805bf1b18636e40bb69e8f88d046b9.svg

Comparing Models#

Next to evaluating our main model using model fit measures, we can also compare it to another model. In the initial model, the latent factors are assumend to covary. However, a model, in which the latent factors are set to be independent might provide a better fit. To specify such a model we need to set the covariances between the factors to be zero.

desc2 = '''visual =~ x1 + x2 + x3
           text =~ x4 + x5 + x6
           speed =~ x7 + x8 + x9
           
           # Set covariance to zero
           speed ~~ 0 * visual
           speed ~~ 0 * text
           text ~~ 0 * visual'''

# Fit the model
model2 = semopy.Model(desc2)
results2 = model2.fit(data)

# Print results
estimates2 = model2.inspect(std_est=True)
print(estimates2)
      lval  op    rval  Estimate  Est. Std  Std. Err    z-value   p-value
0       x1   ~  visual  1.000000  0.621242         -          -         -
1       x2   ~  visual  0.777690  0.479052  0.140577   5.532121       0.0
2       x3   ~  visual  1.106134  0.709327  0.213713   5.175797       0.0
3       x4   ~    text  1.000000  0.846971         -          -         -
4       x5   ~    text  1.132942  0.865571  0.067026  16.903003       0.0
5       x6   ~    text  0.924249  0.831695  0.056397  16.388161       0.0
6       x7   ~   speed  1.000000  0.607451         -          -         -
7       x8   ~   speed  1.225004  0.800752  0.189741    6.45619       0.0
8       x9   ~   speed  0.854879  0.560621  0.121394   7.042191       0.0
9    speed  ~~  visual  0.000000  0.000000         -          -         -
10   speed  ~~    text  0.000000  0.000000         -          -         -
11   speed  ~~   speed  0.436378  1.000000  0.096596   4.517579  0.000006
12    text  ~~  visual  0.000000  0.000000         -          -         -
13    text  ~~    text  0.968662  1.000000   0.11212   8.639478       0.0
14  visual  ~~  visual  0.524296  1.000000  0.130307   4.023553  0.000057
15      x1  ~~      x1  0.834191  0.614059  0.118174   7.059036       0.0
16      x2  ~~      x2  1.064637  0.770509  0.104633  10.174933       0.0
17      x3  ~~      x3  0.633474  0.496855  0.129037   4.909248  0.000001
18      x4  ~~      x4  0.381652  0.282639  0.048902   7.804458       0.0
19      x5  ~~      x5  0.416184  0.250786  0.059129   7.038589       0.0
20      x6  ~~      x6  0.368783  0.308283  0.044073   8.367492       0.0
21      x7  ~~      x7  0.746228  0.631003  0.086244   8.652507       0.0
22      x8  ~~      x8  0.366430  0.358797  0.096487     3.7977  0.000146
23      x9  ~~      x9  0.695777  0.685704  0.072202   9.636564       0.0
# Print fit statistics
stats2 = semopy.calc_stats(model2)
print(stats2.T)
                    Value
DoF             27.000000
DoF Baseline    36.000000
chi2           153.527262
chi2 p-value     0.000000
chi2 Baseline  918.851637
CFI              0.856683
GFI              0.832914
AGFI             0.777219
NFI              0.832914
TLI              0.808911
RMSEA            0.124983
AIC             34.979885
BIC            101.707870
LogLik           0.510057
# Visualise the model
semopy.semplot(model2, std_ests=True, filename='data/cfa_plot2.pdf')
../../../_images/124510cc307f673473c410239ae555cb534af9dd85135073082d38c8a3f9ce3c.svg

We can see that the covariances between the latent factors (e.g. speed ~~ visual) are now forced to be zero.

print("Model1:")
print(stats.T)  # Model 1 (correlated latent factors)
print("\nModel2:")
print(stats2.T) # Model 2 (independent latent factors)
Model1:
                      Value
DoF            2.400000e+01
DoF Baseline   3.600000e+01
chi2           8.530573e+01
chi2 p-value   8.501896e-09
chi2 Baseline  9.188516e+02
CFI            9.305594e-01
GFI            9.071605e-01
AGFI           8.607407e-01
NFI            9.071605e-01
TLI            8.958391e-01
RMSEA          9.227505e-02
AIC            4.143318e+01
BIC            1.192825e+02
LogLik         2.834077e-01

Model2:
                    Value
DoF             27.000000
DoF Baseline    36.000000
chi2           153.527262
chi2 p-value     0.000000
chi2 Baseline  918.851637
CFI              0.856683
GFI              0.832914
AGFI             0.777219
NFI              0.832914
TLI              0.808911
RMSEA            0.124983
AIC             34.979885
BIC            101.707870
LogLik           0.510057

By comparing AIC and BIC values across models, we can assess which model provides a more parsimonious description of the data. Lower AIC and BIC values favour the simpler independence model. However, statistical fit must always be interpreted in light of theoretical considerations: a model with better information criteria is not necessarily the most meaningful from a substantive perspective.