Confirmatory Factor Analysis & Structural Equation Modelling#

We first install/load the semopy package and the HolzingerSwineford1939 dataset:

# Uncomment the following line to run in Google Colab
# !pip install semopy
import semopy

# Import your data here
data = semopy.examples.holzinger39.get_data()
data
id sex ageyr agemo school grade x1 x2 x3 x4 x5 x6 x7 x8 x9
1 1 1 13 1 Pasteur 7.0 3.333333 7.75 0.375 2.333333 5.75 1.285714 3.391304 5.75 6.361111
2 2 2 13 7 Pasteur 7.0 5.333333 5.25 2.125 1.666667 3.00 1.285714 3.782609 6.25 7.916667
3 3 2 13 1 Pasteur 7.0 4.500000 5.25 1.875 1.000000 1.75 0.428571 3.260870 3.90 4.416667
4 4 1 13 2 Pasteur 7.0 5.333333 7.75 3.000 2.666667 4.50 2.428571 3.000000 5.30 4.861111
5 5 2 12 2 Pasteur 7.0 4.833333 4.75 0.875 2.666667 4.00 2.571429 3.695652 6.30 5.916667
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
297 346 1 13 5 Grant-White 8.0 4.000000 7.00 1.375 2.666667 4.25 1.000000 5.086957 5.60 5.250000
298 347 2 14 10 Grant-White 8.0 3.000000 6.00 1.625 2.333333 4.00 1.000000 4.608696 6.05 6.083333
299 348 2 14 3 Grant-White 8.0 4.666667 5.50 1.875 3.666667 5.75 4.285714 4.000000 6.00 7.611111
300 349 1 14 2 Grant-White 8.0 4.333333 6.75 0.500 3.666667 4.50 2.000000 5.086957 6.20 4.388889
301 351 1 13 5 Grant-White NaN 4.333333 6.00 3.375 3.666667 5.75 3.142857 4.086957 6.95 5.166667

301 rows × 15 columns

Exercise 1: Fit a CFA model#

Fit a CFA model with 3 latent variables. x2 and x3 should load onto visual, x5 and x6 shoud load onto text, x8 and x9 should load onto speed. Assume all latent factors to be uncorrelated with each other. After Specifying the model, fit it and inspect the model estimates as well as the model fit measures.

# Specify the model
desc = '''
    visual =~ x2 + x3
    text =~ x5 + x6
    speed =~ x8 + x9
    
    # Set correlations to zero
    speed ~~ 0 * visual
    speed ~~ 0 * text
    text ~~ 0 * visual
'''

# Fit the model
model = semopy.Model(desc)
results = model.fit(data)
print(results)

# Get the estimates
estimates = model.inspect(std_est=True)
print(estimates)

# Get the fit measures
stats = semopy.calc_stats(model)
print(stats.T)

# Visualize the model and factor correlations
semopy.semplot(model, 'figures/holzinger39.png', plot_covs=True, std_ests=True)
WARNING:root:Fisher Information Matrix is not PD.Moore-Penrose inverse will be used instead of Cholesky decomposition. See 10.1109/TSP.2012.2208105.
WARNING:root:Fisher Information Matrix is not PD.Moore-Penrose inverse will be used instead of Cholesky decomposition. See 10.1109/TSP.2012.2208105.
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.243
Number of iterations: 16
Params: 0.646 0.753 0.644 0.710 1.347 0.699 0.683 0.984 0.313 0.432 0.312 0.721
      lval  op    rval  Estimate  Est. Std  Std. Err    z-value p-value
0       x2   ~  visual  1.000000  0.711243         -          -       -
1       x3   ~  visual  0.645532  0.477991  0.069086   9.343818     0.0
2       x5   ~    text  1.000000  0.900790         -          -       -
3       x6   ~    text  0.753297  0.799264  0.039054  19.288631     0.0
4       x8   ~   speed  1.000000  0.833566         -          -       -
5       x9   ~   speed  0.644022  0.538675  0.052423  12.285023     0.0
6    speed  ~~  visual  0.000000  0.000000         -          -       -
7    speed  ~~    text  0.000000  0.000000         -          -       -
8    speed  ~~   speed  0.710144  1.000000  0.055045  12.901184     0.0
9     text  ~~  visual  0.000000  0.000000         -          -       -
10    text  ~~    text  1.346736  1.000000  0.081896  16.444561     0.0
11  visual  ~~  visual  0.698939  1.000000  0.070463   9.919235     0.0
12      x2  ~~      x2  0.682726  0.494133  0.067459  10.120596     0.0
13      x3  ~~      x3  0.983522  0.771524   0.09242  10.641903     0.0
14      x5  ~~      x5  0.312987  0.188578  0.057231   5.468815     0.0
15      x6  ~~      x6  0.432071  0.361178  0.050901   8.488501     0.0
16      x8  ~~      x8  0.311893  0.305168  0.044302   7.040144     0.0
17      x9  ~~      x9  0.720522  0.709829  0.064109  11.238952     0.0
                      Value
DoF            9.000000e+00
DoF Baseline   1.500000e+01
chi2           7.308156e+01
chi2 p-value   3.776091e-12
chi2 Baseline  3.976807e+02
CFI            8.325456e-01
GFI            8.162306e-01
AGFI           6.937176e-01
NFI            8.162306e-01
TLI            7.209094e-01
RMSEA          1.540581e-01
AIC            2.351441e+01
BIC            6.799973e+01
LogLik         2.427959e-01
../../_images/eef4a3edf0a5ab3a3d3e8fa3c5917cbcfc7b853ba94dd38355554fc50008ff2d.svg

Exercise 2: Fit a SEM#

Adapt your model from above to include a structural part, meaning a unidirectional association on the level of latent variables. Print the model estimates and the model fit statistics. Does the CFA or the SEM model provide better fit? Provide an explanation for your conclusion.

# Specify the model
desc2 = '''
    # Measurement model
    visual =~ x2 + x3
    text =~ x5 + x6
    speed =~ x8 + x9
    
    # Structural model
    visual ~ text
    speed ~ text
    visual~ speed
'''

# Fit the model
model2 = semopy.Model(desc2)
results2 = model2.fit(data)
print(results2)

# Get the  estimates
estimates2 = model2.inspect(std_est=True)
print(estimates2)

# Get the fit measures
stats2 = semopy.calc_stats(model2)
print(stats2.T)

# Visualize the model
semopy.semplot(model2, 'figures/holzinger39_2.png', plot_covs=True, std_ests=True)
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.037
Number of iterations: 30
Params: 1.356 1.114 1.742 0.117 0.526 0.139 0.245 0.911 0.232 1.049 0.663 0.749 0.066 0.759 0.219
      lval  op    rval  Estimate  Est. Std  Std. Err   z-value   p-value
0   visual   ~    text  0.116602  0.192878  0.053476  2.180459  0.029223
1   visual   ~   speed  0.526146  0.466943   0.14389  3.656581  0.000256
2    speed   ~    text  0.139384  0.259796  0.047092  2.959813  0.003078
3       x2   ~  visual  1.000000  0.490794         -         -         -
4       x3   ~  visual  1.355716  0.692933  0.317519  4.269713   0.00002
5       x5   ~    text  1.000000  0.740829         -         -         -
6       x6   ~    text  1.113639  0.971873  0.211906  5.255348       0.0
7       x8   ~   speed  1.000000  0.506684         -         -         -
8       x9   ~   speed  1.742380  0.885699  0.425147  4.098305  0.000042
9    speed  ~~   speed  0.244558  0.932506  0.074144  3.298424  0.000972
10    text  ~~    text  0.911104  1.000000  0.205777  4.427619   0.00001
11  visual  ~~  visual  0.232407  0.697966  0.076277  3.046885  0.002312
12      x2  ~~      x2  1.049368  0.759121  0.114437  9.169868       0.0
13      x3  ~~      x3  0.662589  0.519845  0.149797  4.423247   0.00001
14      x5  ~~      x5  0.748985  0.451172  0.177447  4.220883  0.000024
15      x6  ~~      x6  0.066350  0.055463  0.206704  0.320992  0.748216
16      x8  ~~      x8  0.759281  0.743271  0.085924  8.836609       0.0
17      x9  ~~      x9  0.218759  0.215538  0.181822  1.203154  0.228917
                    Value
DoF              6.000000
DoF Baseline    15.000000
chi2            11.148178
chi2 p-value     0.083903
chi2 Baseline  397.680707
CFI              0.986547
GFI              0.971967
AGFI             0.929918
NFI              0.971967
TLI              0.966368
RMSEA            0.053480
AIC             29.925926
BIC             85.532580
LogLik           0.037037
../../_images/81adeaeddfe2a4d327a57b7d230a42da2cc6d42032a310d163a2c8fed6f04b38.svg

Voluntary exercise 1: Higher level factors#

Go back to your CFA model and add a higher level factor onto which all latent variables load onto. Name it intelligence. Does the higher level factor improve model fit?

# Specify the model
desc3 = '''
    visual =~ x2 + x3
    text =~ x5 + x6
    speed =~ x8 + x9
    
    # Higher level factor
    intelligence =~ text + speed + visual
'''

# Fit the model
model3 = semopy.Model(desc3)
results3 = model3.fit(data)
print(results3)

# Print the model estimates
estimates3 = model3.inspect(std_est=True)
print(estimates3)

# Print the model fit measures
stats3 = semopy.calc_stats(model3)
print(stats3.T)

# Visualize the model
semopy.semplot(model3, 'figures/holzinger39_3.png', plot_covs=True, std_ests=True)
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.037
Number of iterations: 39
Params: 1.363 1.116 1.721 0.897 1.203 0.143 0.151 0.766 0.124 1.051 0.659 0.751 0.064 0.757 0.227
            lval  op          rval  Estimate  Est. Std  Std. Err   z-value  \
0           text   ~  intelligence  1.000000  0.396214         -         -   
1          speed   ~  intelligence  0.897471  0.657223  0.351128  2.555968   
2         visual   ~  intelligence  1.203470  0.790080  0.444534  2.707259   
3             x2   ~        visual  1.000000  0.489410         -         -   
4             x3   ~        visual  1.363364  0.694774   0.32002  4.260246   
5             x5   ~          text  1.000000  0.740019         -         -   
6             x6   ~          text  1.116107  0.972835  0.213113  5.237161   
7             x8   ~         speed  1.000000  0.510062         -         -   
8             x9   ~         speed  1.720615  0.880884  0.415237  4.143696   
9   intelligence  ~~  intelligence  0.142677  1.000000  0.077422  1.842853   
10         speed  ~~         speed  0.151134  0.568058  0.050385  2.999596   
11          text  ~~          text  0.766177  0.843014  0.161477  4.744796   
12        visual  ~~        visual  0.124396  0.375773  0.076715  1.621541   
13            x2  ~~            x2  1.051047  0.760478  0.114328  9.193273   
14            x3  ~~            x3  0.659403  0.517289  0.150627   4.37772   
15            x5  ~~            x5  0.750763  0.452371  0.177671  4.225568   
16            x6  ~~            x6  0.064111  0.053593  0.207847  0.308453   
17            x8  ~~            x8  0.756587  0.739837  0.085921    8.8056   
18            x9  ~~            x9  0.227420  0.224043  0.178078  1.277079   

     p-value  
0          -  
1   0.010589  
2   0.006784  
3          -  
4    0.00002  
5          -  
6        0.0  
7          -  
8   0.000034  
9   0.065351  
10  0.002703  
11  0.000002  
12  0.104902  
13       0.0  
14  0.000012  
15  0.000024  
16  0.757738  
17       0.0  
18  0.201574  
                    Value
DoF              6.000000
DoF Baseline    15.000000
chi2            11.147616
chi2 p-value     0.083920
chi2 Baseline  397.680707
CFI              0.986549
GFI              0.971968
AGFI             0.929921
NFI              0.971968
TLI              0.966371
RMSEA            0.053477
AIC             29.925929
BIC             85.532583
LogLik           0.037035
../../_images/2b59433a3392dac019217d88fe396c292c9e5b693ea41699a958135d8e9aa116.svg

Voluntary exercise 2: Advanced models I#

Now go back to your SEM model and modify it in a way such that the factor variance of the speed factor is fixed to 1. How does that affect the interpretation of the loading associated with that factor?

##solution
# Specify the model
desc4 = '''
    # Measurement model
    visual =~ x2 + x3
    text =~ x5 + x6
    speed =~ x8 + x9

    # Structrual model
    visual ~ text
    speed ~ text
    visual ~ speed

    # Fix factor variance to 1
    speed ~~ 1*speed
'''

# Fit the model
model4 = semopy.Model(desc4)
results4 = model4.fit(data)

# Print the model estimates
estimates4 = model4.inspect(std_est=True)
print(estimates4)

# Print the model fit measures
stats4 = semopy.calc_stats(model4)
print(stats4.T)

# Visualize the model
semopy.semplot(model4, 'figures/holzinger39_4.png', plot_covs=True, std_ests=True)
      lval  op    rval      Estimate      Est. Std  Std. Err    z-value  \
0   visual   ~    text  1.837451e-01  2.829518e-01  0.063006   2.916327   
1   visual   ~   speed  1.448443e-01  2.434435e-01  0.057291   2.528225   
2    speed   ~    text  1.714494e-01  1.570853e-01   0.06553   2.616357   
3       x2   ~  visual  1.000000e+00  5.122049e-01         -          -   
4       x3   ~  visual  1.248551e+00  6.655264e-01  0.367263   3.399608   
5       x5   ~    text  1.000000e+00  7.200297e-01         -          -   
6       x6   ~    text  1.179038e+00  1.000000e+00  0.254336   4.635755   
7       x8   ~   speed  1.000000e+00  9.581981e-01         -          -   
8       x9   ~   speed  4.938572e-01  4.916300e-01  0.065675   7.519672   
9    speed  ~~   speed  1.000000e+00  9.753242e-01         -          -   
10    text  ~~    text  8.606972e-01  1.000000e+00  0.215062   4.002082   
11  visual  ~~  visual  3.045342e-01  8.390326e-01  0.108182   2.815018   
12      x2  ~~      x2  1.020511e+00  7.376462e-01  0.134482   7.588483   
13      x3  ~~      x3  7.116272e-01  5.570746e-01  0.174637   4.074899   
14      x5  ~~      x5  7.994613e-01  4.815572e-01  0.190872   4.188463   
15      x6  ~~      x6  5.362485e-18  4.481881e-18  0.249393        0.0   
16      x8  ~~      x8  9.140986e-02  8.185640e-02  0.087437   1.045431   
17      x9  ~~      x9  7.845454e-01  7.583000e-01  0.069018  11.367247   

     p-value  
0   0.003542  
1   0.011464  
2   0.008887  
3          -  
4   0.000675  
5          -  
6   0.000004  
7          -  
8        0.0  
9          -  
10  0.000063  
11  0.004877  
12       0.0  
13  0.000046  
14  0.000028  
15       1.0  
16  0.295824  
17       0.0  
                      Value
DoF            7.000000e+00
DoF Baseline   1.500000e+01
chi2           4.121157e+01
chi2 p-value   7.372890e-07
chi2 Baseline  3.976807e+02
CFI            9.106002e-01
GFI            8.963702e-01
AGFI           7.779362e-01
NFI            8.963702e-01
TLI            8.084291e-01
RMSEA          1.276371e-01
AIC            2.772617e+01
BIC            7.962571e+01
LogLik         1.369155e-01
../../_images/9cb9d410a13df4da8f7722b2e71b369be8a4bc6cf331160273e22cd6a8a066a2.svg

Voluntary exercise 3: Advanced models II#

Re-load the dataset again, this time without deleting any variables. Specify and evaluate a model that tests the following hypothesis:

  • x1,x2 and x3 should load onto visual, x4,x5 and x6 shoud load onto text, x7,x8 and x9 should load onto speed.

  • visual and text load onto a higher level factor called intelligence.

  • intelligence explains 100% of the covariance between visual and text.

  • intelligence predicts speed.

# Load the dataset
data2 = semopy.examples.holzinger39.get_data()

# Specify the model
desc5 = '''
    visual =~ x1 + x2 + x3
    text =~ x4 + x5 + x6
    speed =~ x7 + x8 + x9

    # Add a higher level factor
    intelligence =~ text + visual

    # No residual covariance between text and visual
    text ~~ 0*visual

    # Intelligence predicts speed
    speed ~ intelligence
'''

# Fit the model
model5 = semopy.Model(desc5)
results5 = model5.fit(data2)

# Get model estimates
estimates5 = model5.inspect()
print(estimates5)

# Get fit statistics
stats5 = semopy.calc_stats(model5)
print(stats5.T)

# Visualize the model
semopy.semplot(model5, 'figures/holzinger39_5.png', plot_covs=True, std_ests=True)
            lval  op          rval  Estimate  Std. Err    z-value   p-value
0           text   ~  intelligence  1.000000         -          -         -
1         visual   ~  intelligence  1.511090  0.394682   3.828623  0.000129
2          speed   ~  intelligence  0.641936  0.150723   4.259032  0.000021
3             x1   ~        visual  1.000000         -          -         -
4             x2   ~        visual  0.554007   0.09969   5.557311       0.0
5             x3   ~        visual  0.730062  0.109142    6.68908       0.0
6             x4   ~          text  1.000000         -          -         -
7             x5   ~          text  1.112913  0.065402  17.016368       0.0
8             x6   ~          text  0.926028  0.055437  16.704137       0.0
9             x7   ~         speed  1.000000         -          -         -
10            x8   ~         speed  1.181997  0.165403   7.146174       0.0
11            x9   ~         speed  1.083819  0.151597   7.149338       0.0
12          text  ~~        visual  0.000000         -          -         -
13          text  ~~          text  0.709723  0.107058   6.629328       0.0
14  intelligence  ~~  intelligence  0.270079  0.090969    2.96892  0.002988
15         speed  ~~         speed  0.271288   0.06871    3.94832  0.000079
16        visual  ~~        visual  0.192062   0.17018   1.128585  0.259073
17            x1  ~~            x1  0.549578  0.113495   4.842322  0.000001
18            x2  ~~            x2  1.133480  0.101706  11.144668       0.0
19            x3  ~~            x3  0.843889  0.090617    9.31273       0.0
20            x4  ~~            x4  0.371101  0.047719   7.776749       0.0
21            x5  ~~            x5  0.446237  0.058388   7.642573       0.0
22            x6  ~~            x6  0.356280  0.043038   8.278273       0.0
23            x7  ~~            x7  0.799993  0.081361   9.832645       0.0
24            x8  ~~            x8  0.487875   0.07424   6.571633       0.0
25            x9  ~~            x9  0.565861  0.070778   7.994932       0.0
                      Value
DoF            2.400000e+01
DoF Baseline   3.600000e+01
chi2           8.530582e+01
chi2 p-value   8.501597e-09
chi2 Baseline  9.188516e+02
CFI            9.305593e-01
GFI            9.071604e-01
AGFI           8.607406e-01
NFI            9.071604e-01
TLI            8.958390e-01
RMSEA          9.227512e-02
AIC            4.143318e+01
BIC            1.192825e+02
LogLik         2.834080e-01
../../_images/06bc282d71750eadd91fdaa1945a42f5898850770ed759ab38a58a8f4476ac09.svg