Confirmatory Factor Analysis & Structural Equation Modelling#

Exercise 1: Preparation and data exploration#

Load the semopypackage and the HolzingerSwineford1939 dataset. If you want to keep your dataset small and organized, you can use .drop() to remove the columns x1, x4and x7.

# Uncomment the following line to run in Google Colab
# !pip install semopy
import semopy

# Import your data here
data = semopy.examples.holzinger39.get_data()

# Drop the unnecessary columns/variables
data = data.drop(columns=['x1', 'x4', 'x7'])
print(data)

# Alternative way to drop columns:
# data = data.drop(['x1', 'x4', 'x7'], axis = "columns")
# print(data)
      id  sex  ageyr  agemo       school  grade    x2     x3    x5        x6  \
1      1    1     13      1      Pasteur    7.0  7.75  0.375  5.75  1.285714   
2      2    2     13      7      Pasteur    7.0  5.25  2.125  3.00  1.285714   
3      3    2     13      1      Pasteur    7.0  5.25  1.875  1.75  0.428571   
4      4    1     13      2      Pasteur    7.0  7.75  3.000  4.50  2.428571   
5      5    2     12      2      Pasteur    7.0  4.75  0.875  4.00  2.571429   
..   ...  ...    ...    ...          ...    ...   ...    ...   ...       ...   
297  346    1     13      5  Grant-White    8.0  7.00  1.375  4.25  1.000000   
298  347    2     14     10  Grant-White    8.0  6.00  1.625  4.00  1.000000   
299  348    2     14      3  Grant-White    8.0  5.50  1.875  5.75  4.285714   
300  349    1     14      2  Grant-White    8.0  6.75  0.500  4.50  2.000000   
301  351    1     13      5  Grant-White    NaN  6.00  3.375  5.75  3.142857   

       x8        x9  
1    5.75  6.361111  
2    6.25  7.916667  
3    3.90  4.416667  
4    5.30  4.861111  
5    6.30  5.916667  
..    ...       ...  
297  5.60  5.250000  
298  6.05  6.083333  
299  6.00  7.611111  
300  6.20  4.388889  
301  6.95  5.166667  

[301 rows x 12 columns]

Exercise 2: Fit a CFA model#

Fit a CFA model with 3 latent variables. x2 and x3 should load onto visual, x5 and x6 shoud load onto text, x8 and x9 should load onto speed. Assume all latent factors to be uncorrelated with each other. After Specifying the model, fit it and inspect the model estimates as well as the model fit measures.

# Specify the model
desc = '''
    visual =~ x2 + x3
    text =~ x5 + x6
    speed =~ x8 + x9
    
    # Set correlations to zero
    speed ~~ 0 * visual
    speed ~~ 0 * text
    text ~~ 0 * visual
'''

# Fit the model
model = semopy.Model(desc)
results = model.fit(data)
print(results)

# Get the estimates
estimates = model.inspect()
print(estimates)

# Get the fit measures
stats = semopy.calc_stats(model)
print(stats.T)

# Visualize the model and factor correlations
semopy.semplot(model, 'figures/holzinger39.png', plot_covs=True)
WARNING:root:Fisher Information Matrix is not PD.Moore-Penrose inverse will be used instead of Cholesky decomposition. See 10.1109/TSP.2012.2208105.
WARNING:root:Fisher Information Matrix is not PD.Moore-Penrose inverse will be used instead of Cholesky decomposition. See 10.1109/TSP.2012.2208105.
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.243
Number of iterations: 16
Params: 0.646 0.753 0.644 0.710 1.347 0.699 0.683 0.984 0.313 0.432 0.312 0.721
      lval  op    rval  Estimate  Std. Err    z-value p-value
0       x2   ~  visual  1.000000         -          -       -
1       x3   ~  visual  0.645532  0.069086   9.343818     0.0
2       x5   ~    text  1.000000         -          -       -
3       x6   ~    text  0.753297  0.039054  19.288631     0.0
4       x8   ~   speed  1.000000         -          -       -
5       x9   ~   speed  0.644022  0.052423  12.285023     0.0
6    speed  ~~  visual  0.000000         -          -       -
7    speed  ~~    text  0.000000         -          -       -
8    speed  ~~   speed  0.710144  0.055045  12.901184     0.0
9     text  ~~  visual  0.000000         -          -       -
10    text  ~~    text  1.346736  0.081896  16.444561     0.0
11  visual  ~~  visual  0.698939  0.070463   9.919235     0.0
12      x2  ~~      x2  0.682726  0.067459  10.120596     0.0
13      x3  ~~      x3  0.983522   0.09242  10.641903     0.0
14      x5  ~~      x5  0.312987  0.057231   5.468815     0.0
15      x6  ~~      x6  0.432071  0.050901   8.488501     0.0
16      x8  ~~      x8  0.311893  0.044302   7.040144     0.0
17      x9  ~~      x9  0.720522  0.064109  11.238952     0.0
                      Value
DoF            9.000000e+00
DoF Baseline   1.500000e+01
chi2           7.308156e+01
chi2 p-value   3.776091e-12
chi2 Baseline  3.976807e+02
CFI            8.325456e-01
GFI            8.162306e-01
AGFI           6.937176e-01
NFI            8.162306e-01
TLI            7.209094e-01
RMSEA          1.540581e-01
AIC            2.351441e+01
BIC            6.799973e+01
LogLik         2.427959e-01
../../_images/4c7c60edf7651bc57564426be753f46b09f3928cd30b00cf094fc374264ed061.svg

Exercise 3: Fit a SEM model#

Adapt your model from above to include a structural part, meaning a unidirectional association on the level of latent variables. Print the model estimates and the model fit statistics. Does the CFA or the SEM model provide better fit? Provide an explanation for your conclusion.

# Specify the model
desc2 = '''
    # Measurement model
    visual =~ x2 + x3
    text =~ x5 + x6
    speed =~ x8 + x9
    
    # Structural model
    visual ~ text
    speed ~ text
    visual~ speed
'''

# Fit the model
model2 = semopy.Model(desc2)
results2 = model2.fit(data)
print(results2)

# Get the  estimates
estimates2 = model2.inspect()
print(estimates2)

# Get the fit measures
stats2 = semopy.calc_stats(model2)
print(stats2.T)

# Visualize the model
semopy.semplot(model2, 'figures/holzinger39_2.png')
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.037
Number of iterations: 30
Params: 1.356 1.114 1.742 0.117 0.526 0.139 0.245 0.911 0.232 1.049 0.663 0.749 0.066 0.759 0.219
      lval  op    rval  Estimate  Std. Err   z-value   p-value
0   visual   ~    text  0.116602  0.053476  2.180459  0.029223
1   visual   ~   speed  0.526146   0.14389  3.656581  0.000256
2    speed   ~    text  0.139384  0.047092  2.959813  0.003078
3       x2   ~  visual  1.000000         -         -         -
4       x3   ~  visual  1.355716  0.317519  4.269713   0.00002
5       x5   ~    text  1.000000         -         -         -
6       x6   ~    text  1.113639  0.211906  5.255348       0.0
7       x8   ~   speed  1.000000         -         -         -
8       x9   ~   speed  1.742380  0.425147  4.098305  0.000042
9    speed  ~~   speed  0.244558  0.074144  3.298424  0.000972
10    text  ~~    text  0.911104  0.205777  4.427619   0.00001
11  visual  ~~  visual  0.232407  0.076277  3.046885  0.002312
12      x2  ~~      x2  1.049368  0.114437  9.169868       0.0
13      x3  ~~      x3  0.662589  0.149797  4.423247   0.00001
14      x5  ~~      x5  0.748985  0.177447  4.220883  0.000024
15      x6  ~~      x6  0.066350  0.206704  0.320992  0.748216
16      x8  ~~      x8  0.759281  0.085924  8.836609       0.0
17      x9  ~~      x9  0.218759  0.181822  1.203154  0.228917
                    Value
DoF              6.000000
DoF Baseline    15.000000
chi2            11.148178
chi2 p-value     0.083903
chi2 Baseline  397.680707
CFI              0.986547
GFI              0.971967
AGFI             0.929918
NFI              0.971967
TLI              0.966368
RMSEA            0.053480
AIC             29.925926
BIC             85.532580
LogLik           0.037037
../../_images/0425c8a4636b9091fe017cbec2c5a2231d0c4fc807c2f9d5adff09f72f7479ae.svg

Voluntary exercise 1: Higher level factors#

Go back to your CFA model and add a higher level factor onto which all latent variables load onto. Name it intelligence. Does the higher level factor improve model fit?

# Specify the model
desc3 = '''
    visual =~ x2 + x3
    text =~ x5 + x6
    speed =~ x8 + x9
    
    # Higher level factor
    intelligence =~ text + speed + visual
'''

# Fit the model
model3 = semopy.Model(desc3)
results3 = model3.fit(data)
print(results3)

# Print the model estimates
estimates3 = model3.inspect()
print(estimates3)

# Print the model fit measures
stats3 = semopy.calc_stats(model3)
print(stats3.T)

# Visualize the model
semopy.semplot(model3, 'figures/holzinger39_3.png')
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.037
Number of iterations: 39
Params: 1.363 1.116 1.721 0.897 1.203 0.143 0.151 0.766 0.124 1.051 0.659 0.751 0.064 0.757 0.227
            lval  op          rval  Estimate  Std. Err   z-value   p-value
0           text   ~  intelligence  1.000000         -         -         -
1          speed   ~  intelligence  0.897471  0.351128  2.555968  0.010589
2         visual   ~  intelligence  1.203470  0.444534  2.707259  0.006784
3             x2   ~        visual  1.000000         -         -         -
4             x3   ~        visual  1.363364   0.32002  4.260246   0.00002
5             x5   ~          text  1.000000         -         -         -
6             x6   ~          text  1.116107  0.213113  5.237161       0.0
7             x8   ~         speed  1.000000         -         -         -
8             x9   ~         speed  1.720615  0.415237  4.143696  0.000034
9   intelligence  ~~  intelligence  0.142677  0.077422  1.842853  0.065351
10         speed  ~~         speed  0.151134  0.050385  2.999596  0.002703
11          text  ~~          text  0.766177  0.161477  4.744796  0.000002
12        visual  ~~        visual  0.124396  0.076715  1.621541  0.104902
13            x2  ~~            x2  1.051047  0.114328  9.193273       0.0
14            x3  ~~            x3  0.659403  0.150627   4.37772  0.000012
15            x5  ~~            x5  0.750763  0.177671  4.225568  0.000024
16            x6  ~~            x6  0.064111  0.207847  0.308453  0.757738
17            x8  ~~            x8  0.756587  0.085921    8.8056       0.0
18            x9  ~~            x9  0.227420  0.178078  1.277079  0.201574
                    Value
DoF              6.000000
DoF Baseline    15.000000
chi2            11.147616
chi2 p-value     0.083920
chi2 Baseline  397.680707
CFI              0.986549
GFI              0.971968
AGFI             0.929921
NFI              0.971968
TLI              0.966371
RMSEA            0.053477
AIC             29.925929
BIC             85.532583
LogLik           0.037035
../../_images/e4fb9c6f9bc64a7cf1229030ea35ea96d5f4c4f20b319984b937f63185524c12.svg

Voluntary exercise 2: Advanced models I#

Now go back to your SEM model and modify it in a way such that the factor variance of the speed factor is fixed to 1. How does that affect the interpretation of the loading associated with that factor?

##solution
# Specify the model
desc4 = '''
    # Measurement model
    visual =~ x2 + x3
    text =~ x5 + x6
    speed =~ x8 + x9

    # Structrual model
    visual ~ text
    speed ~ text
    visual ~ speed

    # Fix factor variance to 1
    speed ~~ 1*speed
'''

# Fit the model
model4 = semopy.Model(desc4)
results4 = model4.fit(data)

# Print the model estimates
estimates4 = model4.inspect()
print(estimates4)

# Print the model fit measures
stats4 = semopy.calc_stats(model4)
print(stats4.T)

# Visualize the model
semopy.semplot(model4, 'figures/holzinger39_4.png')
      lval  op    rval  Estimate  Std. Err    z-value   p-value
0   visual   ~    text  0.183745  0.063006   2.916327  0.003542
1   visual   ~   speed  0.144844  0.057291   2.528225  0.011464
2    speed   ~    text  0.171449   0.06553   2.616357  0.008887
3       x2   ~  visual  1.000000         -          -         -
4       x3   ~  visual  1.248551  0.367263   3.399608  0.000675
5       x5   ~    text  1.000000         -          -         -
6       x6   ~    text  1.179038  0.254336   4.635755  0.000004
7       x8   ~   speed  1.000000         -          -         -
8       x9   ~   speed  0.493857  0.065675   7.519672       0.0
9    speed  ~~   speed  1.000000         -          -         -
10    text  ~~    text  0.860697  0.215062   4.002082  0.000063
11  visual  ~~  visual  0.304534  0.108182   2.815018  0.004877
12      x2  ~~      x2  1.020511  0.134482   7.588483       0.0
13      x3  ~~      x3  0.711627  0.174637   4.074899  0.000046
14      x5  ~~      x5  0.799461  0.190872   4.188463  0.000028
15      x6  ~~      x6  0.000000  0.249393        0.0       1.0
16      x8  ~~      x8  0.091410  0.087437   1.045431  0.295824
17      x9  ~~      x9  0.784545  0.069018  11.367247       0.0
                      Value
DoF            7.000000e+00
DoF Baseline   1.500000e+01
chi2           4.121157e+01
chi2 p-value   7.372890e-07
chi2 Baseline  3.976807e+02
CFI            9.106002e-01
GFI            8.963702e-01
AGFI           7.779362e-01
NFI            8.963702e-01
TLI            8.084291e-01
RMSEA          1.276371e-01
AIC            2.772617e+01
BIC            7.962571e+01
LogLik         1.369155e-01
../../_images/eba7766e9476baf5199e31159cf08257e7adb31cde15df73b323b156b2171078.svg

Voluntary exercise 3: Advanced models II#

Re-load the dataset again, this time without deleting any variables. Specify and evaluate a model that tests the following hypothesis:

  • x1,x2 and x3 should load onto visual, x4,x5 and x6 shoud load onto text, x7,x8 and x9 should load onto speed.

  • visual and text load onto a higher level factor called intelligence.

  • intelligence explains 100% of the covariance between visual and text.

  • intelligence predicts speed.

# Load the dataset
data2 = semopy.examples.holzinger39.get_data()

# Specify the model
desc5 = '''
    visual =~ x1 + x2 + x3
    text =~ x4 + x5 + x6
    speed =~ x7 + x8 + x9

    # Add a higher level factor
    intelligence =~ text + visual

    # No residual covariance between text and visual
    text ~~ 0*visual

    # Intelligence predicts speed
    speed ~ intelligence
'''

# Fit the model
model5 = semopy.Model(desc5)
results5 = model5.fit(data2)

# Get model estimates
estimates5 = model5.inspect()
print(estimates5)

# Get fit statistics
stats5 = semopy.calc_stats(model5)
print(stats5.T)

# Visualize the model
semopy.semplot(model5, 'figures/holzinger39_5.png', plot_covs=True)
            lval  op          rval  Estimate  Std. Err    z-value   p-value
0           text   ~  intelligence  1.000000         -          -         -
1         visual   ~  intelligence  1.511090  0.394682   3.828623  0.000129
2          speed   ~  intelligence  0.641936  0.150723   4.259032  0.000021
3             x1   ~        visual  1.000000         -          -         -
4             x2   ~        visual  0.554007   0.09969   5.557311       0.0
5             x3   ~        visual  0.730062  0.109142    6.68908       0.0
6             x4   ~          text  1.000000         -          -         -
7             x5   ~          text  1.112913  0.065402  17.016368       0.0
8             x6   ~          text  0.926028  0.055437  16.704137       0.0
9             x7   ~         speed  1.000000         -          -         -
10            x8   ~         speed  1.181997  0.165403   7.146174       0.0
11            x9   ~         speed  1.083819  0.151597   7.149338       0.0
12          text  ~~        visual  0.000000         -          -         -
13          text  ~~          text  0.709723  0.107058   6.629328       0.0
14  intelligence  ~~  intelligence  0.270079  0.090969    2.96892  0.002988
15         speed  ~~         speed  0.271288   0.06871    3.94832  0.000079
16        visual  ~~        visual  0.192062   0.17018   1.128585  0.259073
17            x1  ~~            x1  0.549578  0.113495   4.842322  0.000001
18            x2  ~~            x2  1.133480  0.101706  11.144668       0.0
19            x3  ~~            x3  0.843889  0.090617    9.31273       0.0
20            x4  ~~            x4  0.371101  0.047719   7.776749       0.0
21            x5  ~~            x5  0.446237  0.058388   7.642573       0.0
22            x6  ~~            x6  0.356280  0.043038   8.278273       0.0
23            x7  ~~            x7  0.799993  0.081361   9.832645       0.0
24            x8  ~~            x8  0.487875   0.07424   6.571633       0.0
25            x9  ~~            x9  0.565861  0.070778   7.994932       0.0
                      Value
DoF            2.400000e+01
DoF Baseline   3.600000e+01
chi2           8.530582e+01
chi2 p-value   8.501597e-09
chi2 Baseline  9.188516e+02
CFI            9.305593e-01
GFI            9.071604e-01
AGFI           8.607406e-01
NFI            9.071604e-01
TLI            8.958390e-01
RMSEA          9.227512e-02
AIC            4.143318e+01
BIC            1.192825e+02
LogLik         2.834080e-01
../../_images/8d2c88be3c84592998679a897b43753426900680748aa6413a25a7b6de31ebfb.svg