Confirmatory Factor Analysis & Structural Equation Modelling#
Exercise 1: Preparation and data exploration#
Load the semopy
package and the HolzingerSwineford1939
dataset. If you want to keep your dataset small and organized, you can use .drop()
to remove the columns x1
, x4
and x7
.
# Uncomment the following line to run in Google Colab
# !pip install semopy
import semopy
# Import your data here
data = semopy.examples.holzinger39.get_data()
# Drop the unnecessary columns/variables
data = data.drop(columns=['x1', 'x4', 'x7'])
print(data)
# Alternative way to drop columns:
# data = data.drop(['x1', 'x4', 'x7'], axis = "columns")
# print(data)
id sex ageyr agemo school grade x2 x3 x5 x6 \
1 1 1 13 1 Pasteur 7.0 7.75 0.375 5.75 1.285714
2 2 2 13 7 Pasteur 7.0 5.25 2.125 3.00 1.285714
3 3 2 13 1 Pasteur 7.0 5.25 1.875 1.75 0.428571
4 4 1 13 2 Pasteur 7.0 7.75 3.000 4.50 2.428571
5 5 2 12 2 Pasteur 7.0 4.75 0.875 4.00 2.571429
.. ... ... ... ... ... ... ... ... ... ...
297 346 1 13 5 Grant-White 8.0 7.00 1.375 4.25 1.000000
298 347 2 14 10 Grant-White 8.0 6.00 1.625 4.00 1.000000
299 348 2 14 3 Grant-White 8.0 5.50 1.875 5.75 4.285714
300 349 1 14 2 Grant-White 8.0 6.75 0.500 4.50 2.000000
301 351 1 13 5 Grant-White NaN 6.00 3.375 5.75 3.142857
x8 x9
1 5.75 6.361111
2 6.25 7.916667
3 3.90 4.416667
4 5.30 4.861111
5 6.30 5.916667
.. ... ...
297 5.60 5.250000
298 6.05 6.083333
299 6.00 7.611111
300 6.20 4.388889
301 6.95 5.166667
[301 rows x 12 columns]
Exercise 2: Fit a CFA model#
Fit a CFA model with 3 latent variables. x2
and x3
should load onto visual
, x5
and x6
shoud load onto text
, x8
and x9
should load onto speed
. Assume all latent factors to be uncorrelated with each other. After Specifying the model, fit it and inspect the model estimates as well as the model fit measures.
# Specify the model
desc = '''
visual =~ x2 + x3
text =~ x5 + x6
speed =~ x8 + x9
# Set correlations to zero
speed ~~ 0 * visual
speed ~~ 0 * text
text ~~ 0 * visual
'''
# Fit the model
model = semopy.Model(desc)
results = model.fit(data)
print(results)
# Get the estimates
estimates = model.inspect()
print(estimates)
# Get the fit measures
stats = semopy.calc_stats(model)
print(stats.T)
# Visualize the model and factor correlations
semopy.semplot(model, 'figures/holzinger39.png', plot_covs=True)
WARNING:root:Fisher Information Matrix is not PD.Moore-Penrose inverse will be used instead of Cholesky decomposition. See 10.1109/TSP.2012.2208105.
WARNING:root:Fisher Information Matrix is not PD.Moore-Penrose inverse will be used instead of Cholesky decomposition. See 10.1109/TSP.2012.2208105.
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.243
Number of iterations: 16
Params: 0.646 0.753 0.644 0.710 1.347 0.699 0.683 0.984 0.313 0.432 0.312 0.721
lval op rval Estimate Std. Err z-value p-value
0 x2 ~ visual 1.000000 - - -
1 x3 ~ visual 0.645532 0.069086 9.343818 0.0
2 x5 ~ text 1.000000 - - -
3 x6 ~ text 0.753297 0.039054 19.288631 0.0
4 x8 ~ speed 1.000000 - - -
5 x9 ~ speed 0.644022 0.052423 12.285023 0.0
6 speed ~~ visual 0.000000 - - -
7 speed ~~ text 0.000000 - - -
8 speed ~~ speed 0.710144 0.055045 12.901184 0.0
9 text ~~ visual 0.000000 - - -
10 text ~~ text 1.346736 0.081896 16.444561 0.0
11 visual ~~ visual 0.698939 0.070463 9.919235 0.0
12 x2 ~~ x2 0.682726 0.067459 10.120596 0.0
13 x3 ~~ x3 0.983522 0.09242 10.641903 0.0
14 x5 ~~ x5 0.312987 0.057231 5.468815 0.0
15 x6 ~~ x6 0.432071 0.050901 8.488501 0.0
16 x8 ~~ x8 0.311893 0.044302 7.040144 0.0
17 x9 ~~ x9 0.720522 0.064109 11.238952 0.0
Value
DoF 9.000000e+00
DoF Baseline 1.500000e+01
chi2 7.308156e+01
chi2 p-value 3.776091e-12
chi2 Baseline 3.976807e+02
CFI 8.325456e-01
GFI 8.162306e-01
AGFI 6.937176e-01
NFI 8.162306e-01
TLI 7.209094e-01
RMSEA 1.540581e-01
AIC 2.351441e+01
BIC 6.799973e+01
LogLik 2.427959e-01
Exercise 3: Fit a SEM model#
Adapt your model from above to include a structural part, meaning a unidirectional association on the level of latent variables. Print the model estimates and the model fit statistics. Does the CFA or the SEM model provide better fit? Provide an explanation for your conclusion.
# Specify the model
desc2 = '''
# Measurement model
visual =~ x2 + x3
text =~ x5 + x6
speed =~ x8 + x9
# Structural model
visual ~ text
speed ~ text
visual~ speed
'''
# Fit the model
model2 = semopy.Model(desc2)
results2 = model2.fit(data)
print(results2)
# Get the estimates
estimates2 = model2.inspect()
print(estimates2)
# Get the fit measures
stats2 = semopy.calc_stats(model2)
print(stats2.T)
# Visualize the model
semopy.semplot(model2, 'figures/holzinger39_2.png')
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.037
Number of iterations: 30
Params: 1.356 1.114 1.742 0.117 0.526 0.139 0.245 0.911 0.232 1.049 0.663 0.749 0.066 0.759 0.219
lval op rval Estimate Std. Err z-value p-value
0 visual ~ text 0.116602 0.053476 2.180459 0.029223
1 visual ~ speed 0.526146 0.14389 3.656581 0.000256
2 speed ~ text 0.139384 0.047092 2.959813 0.003078
3 x2 ~ visual 1.000000 - - -
4 x3 ~ visual 1.355716 0.317519 4.269713 0.00002
5 x5 ~ text 1.000000 - - -
6 x6 ~ text 1.113639 0.211906 5.255348 0.0
7 x8 ~ speed 1.000000 - - -
8 x9 ~ speed 1.742380 0.425147 4.098305 0.000042
9 speed ~~ speed 0.244558 0.074144 3.298424 0.000972
10 text ~~ text 0.911104 0.205777 4.427619 0.00001
11 visual ~~ visual 0.232407 0.076277 3.046885 0.002312
12 x2 ~~ x2 1.049368 0.114437 9.169868 0.0
13 x3 ~~ x3 0.662589 0.149797 4.423247 0.00001
14 x5 ~~ x5 0.748985 0.177447 4.220883 0.000024
15 x6 ~~ x6 0.066350 0.206704 0.320992 0.748216
16 x8 ~~ x8 0.759281 0.085924 8.836609 0.0
17 x9 ~~ x9 0.218759 0.181822 1.203154 0.228917
Value
DoF 6.000000
DoF Baseline 15.000000
chi2 11.148178
chi2 p-value 0.083903
chi2 Baseline 397.680707
CFI 0.986547
GFI 0.971967
AGFI 0.929918
NFI 0.971967
TLI 0.966368
RMSEA 0.053480
AIC 29.925926
BIC 85.532580
LogLik 0.037037
Voluntary exercise 1: Higher level factors#
Go back to your CFA model and add a higher level factor onto which all latent variables load onto. Name it intelligence
. Does the higher level factor improve model fit?
# Specify the model
desc3 = '''
visual =~ x2 + x3
text =~ x5 + x6
speed =~ x8 + x9
# Higher level factor
intelligence =~ text + speed + visual
'''
# Fit the model
model3 = semopy.Model(desc3)
results3 = model3.fit(data)
print(results3)
# Print the model estimates
estimates3 = model3.inspect()
print(estimates3)
# Print the model fit measures
stats3 = semopy.calc_stats(model3)
print(stats3.T)
# Visualize the model
semopy.semplot(model3, 'figures/holzinger39_3.png')
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.037
Number of iterations: 39
Params: 1.363 1.116 1.721 0.897 1.203 0.143 0.151 0.766 0.124 1.051 0.659 0.751 0.064 0.757 0.227
lval op rval Estimate Std. Err z-value p-value
0 text ~ intelligence 1.000000 - - -
1 speed ~ intelligence 0.897471 0.351128 2.555968 0.010589
2 visual ~ intelligence 1.203470 0.444534 2.707259 0.006784
3 x2 ~ visual 1.000000 - - -
4 x3 ~ visual 1.363364 0.32002 4.260246 0.00002
5 x5 ~ text 1.000000 - - -
6 x6 ~ text 1.116107 0.213113 5.237161 0.0
7 x8 ~ speed 1.000000 - - -
8 x9 ~ speed 1.720615 0.415237 4.143696 0.000034
9 intelligence ~~ intelligence 0.142677 0.077422 1.842853 0.065351
10 speed ~~ speed 0.151134 0.050385 2.999596 0.002703
11 text ~~ text 0.766177 0.161477 4.744796 0.000002
12 visual ~~ visual 0.124396 0.076715 1.621541 0.104902
13 x2 ~~ x2 1.051047 0.114328 9.193273 0.0
14 x3 ~~ x3 0.659403 0.150627 4.37772 0.000012
15 x5 ~~ x5 0.750763 0.177671 4.225568 0.000024
16 x6 ~~ x6 0.064111 0.207847 0.308453 0.757738
17 x8 ~~ x8 0.756587 0.085921 8.8056 0.0
18 x9 ~~ x9 0.227420 0.178078 1.277079 0.201574
Value
DoF 6.000000
DoF Baseline 15.000000
chi2 11.147616
chi2 p-value 0.083920
chi2 Baseline 397.680707
CFI 0.986549
GFI 0.971968
AGFI 0.929921
NFI 0.971968
TLI 0.966371
RMSEA 0.053477
AIC 29.925929
BIC 85.532583
LogLik 0.037035
Voluntary exercise 2: Advanced models I#
Now go back to your SEM model and modify it in a way such that the factor variance of the speed
factor is fixed to 1. How does that affect the interpretation of the loading associated with that factor?
##solution
# Specify the model
desc4 = '''
# Measurement model
visual =~ x2 + x3
text =~ x5 + x6
speed =~ x8 + x9
# Structrual model
visual ~ text
speed ~ text
visual ~ speed
# Fix factor variance to 1
speed ~~ 1*speed
'''
# Fit the model
model4 = semopy.Model(desc4)
results4 = model4.fit(data)
# Print the model estimates
estimates4 = model4.inspect()
print(estimates4)
# Print the model fit measures
stats4 = semopy.calc_stats(model4)
print(stats4.T)
# Visualize the model
semopy.semplot(model4, 'figures/holzinger39_4.png')
lval op rval Estimate Std. Err z-value p-value
0 visual ~ text 0.183745 0.063006 2.916327 0.003542
1 visual ~ speed 0.144844 0.057291 2.528225 0.011464
2 speed ~ text 0.171449 0.06553 2.616357 0.008887
3 x2 ~ visual 1.000000 - - -
4 x3 ~ visual 1.248551 0.367263 3.399608 0.000675
5 x5 ~ text 1.000000 - - -
6 x6 ~ text 1.179038 0.254336 4.635755 0.000004
7 x8 ~ speed 1.000000 - - -
8 x9 ~ speed 0.493857 0.065675 7.519672 0.0
9 speed ~~ speed 1.000000 - - -
10 text ~~ text 0.860697 0.215062 4.002082 0.000063
11 visual ~~ visual 0.304534 0.108182 2.815018 0.004877
12 x2 ~~ x2 1.020511 0.134482 7.588483 0.0
13 x3 ~~ x3 0.711627 0.174637 4.074899 0.000046
14 x5 ~~ x5 0.799461 0.190872 4.188463 0.000028
15 x6 ~~ x6 0.000000 0.249393 0.0 1.0
16 x8 ~~ x8 0.091410 0.087437 1.045431 0.295824
17 x9 ~~ x9 0.784545 0.069018 11.367247 0.0
Value
DoF 7.000000e+00
DoF Baseline 1.500000e+01
chi2 4.121157e+01
chi2 p-value 7.372890e-07
chi2 Baseline 3.976807e+02
CFI 9.106002e-01
GFI 8.963702e-01
AGFI 7.779362e-01
NFI 8.963702e-01
TLI 8.084291e-01
RMSEA 1.276371e-01
AIC 2.772617e+01
BIC 7.962571e+01
LogLik 1.369155e-01
Voluntary exercise 3: Advanced models II#
Re-load the dataset again, this time without deleting any variables. Specify and evaluate a model that tests the following hypothesis:
x1
,x2
andx3
should load ontovisual
,x4
,x5
andx6
shoud load ontotext
,x7
,x8
andx9
should load ontospeed
.visual
andtext
load onto a higher level factor calledintelligence
.intelligence
explains 100% of the covariance betweenvisual
andtext
.intelligence
predictsspeed
.
# Load the dataset
data2 = semopy.examples.holzinger39.get_data()
# Specify the model
desc5 = '''
visual =~ x1 + x2 + x3
text =~ x4 + x5 + x6
speed =~ x7 + x8 + x9
# Add a higher level factor
intelligence =~ text + visual
# No residual covariance between text and visual
text ~~ 0*visual
# Intelligence predicts speed
speed ~ intelligence
'''
# Fit the model
model5 = semopy.Model(desc5)
results5 = model5.fit(data2)
# Get model estimates
estimates5 = model5.inspect()
print(estimates5)
# Get fit statistics
stats5 = semopy.calc_stats(model5)
print(stats5.T)
# Visualize the model
semopy.semplot(model5, 'figures/holzinger39_5.png', plot_covs=True)
lval op rval Estimate Std. Err z-value p-value
0 text ~ intelligence 1.000000 - - -
1 visual ~ intelligence 1.511090 0.394682 3.828623 0.000129
2 speed ~ intelligence 0.641936 0.150723 4.259032 0.000021
3 x1 ~ visual 1.000000 - - -
4 x2 ~ visual 0.554007 0.09969 5.557311 0.0
5 x3 ~ visual 0.730062 0.109142 6.68908 0.0
6 x4 ~ text 1.000000 - - -
7 x5 ~ text 1.112913 0.065402 17.016368 0.0
8 x6 ~ text 0.926028 0.055437 16.704137 0.0
9 x7 ~ speed 1.000000 - - -
10 x8 ~ speed 1.181997 0.165403 7.146174 0.0
11 x9 ~ speed 1.083819 0.151597 7.149338 0.0
12 text ~~ visual 0.000000 - - -
13 text ~~ text 0.709723 0.107058 6.629328 0.0
14 intelligence ~~ intelligence 0.270079 0.090969 2.96892 0.002988
15 speed ~~ speed 0.271288 0.06871 3.94832 0.000079
16 visual ~~ visual 0.192062 0.17018 1.128585 0.259073
17 x1 ~~ x1 0.549578 0.113495 4.842322 0.000001
18 x2 ~~ x2 1.133480 0.101706 11.144668 0.0
19 x3 ~~ x3 0.843889 0.090617 9.31273 0.0
20 x4 ~~ x4 0.371101 0.047719 7.776749 0.0
21 x5 ~~ x5 0.446237 0.058388 7.642573 0.0
22 x6 ~~ x6 0.356280 0.043038 8.278273 0.0
23 x7 ~~ x7 0.799993 0.081361 9.832645 0.0
24 x8 ~~ x8 0.487875 0.07424 6.571633 0.0
25 x9 ~~ x9 0.565861 0.070778 7.994932 0.0
Value
DoF 2.400000e+01
DoF Baseline 3.600000e+01
chi2 8.530582e+01
chi2 p-value 8.501597e-09
chi2 Baseline 9.188516e+02
CFI 9.305593e-01
GFI 9.071604e-01
AGFI 8.607406e-01
NFI 9.071604e-01
TLI 8.958390e-01
RMSEA 9.227512e-02
AIC 4.143318e+01
BIC 1.192825e+02
LogLik 2.834080e-01