Path Modelling#
In this exercise, we will revisit the Cleveland Heart Disease dataset, which we already explored in the categorical regression session. This dataset is widely used in medical research and machine learning for predicting heart disease. It includes data from patients with suspected heart conditions and features a variety of clinical and demographic attributes.
As done before, we first load the dataset and combine features (predictors) and targets into a single DataFrame, before having a look at it:
import pandas as pd
import semopy
from semopy import calc_stats
from ucimlrepo import fetch_ucirepo
# Fetch dataset
heart_disease = fetch_ucirepo(id=45)
# Get data (they already are DataFrames)
X = heart_disease.data.features
y = heart_disease.data.targets
# Create a combined DataFrame
df = pd.concat([X, y], axis=1)
print(df.describe())
print(df.head())
age sex cp trestbps chol fbs \
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.438944 0.679868 3.158416 131.689769 246.693069 0.148515
std 9.038662 0.467299 0.960126 17.599748 51.776918 0.356198
min 29.000000 0.000000 1.000000 94.000000 126.000000 0.000000
25% 48.000000 0.000000 3.000000 120.000000 211.000000 0.000000
50% 56.000000 1.000000 3.000000 130.000000 241.000000 0.000000
75% 61.000000 1.000000 4.000000 140.000000 275.000000 0.000000
max 77.000000 1.000000 4.000000 200.000000 564.000000 1.000000
restecg thalach exang oldpeak slope ca \
count 303.000000 303.000000 303.000000 303.000000 303.000000 299.000000
mean 0.990099 149.607261 0.326733 1.039604 1.600660 0.672241
std 0.994971 22.875003 0.469794 1.161075 0.616226 0.937438
min 0.000000 71.000000 0.000000 0.000000 1.000000 0.000000
25% 0.000000 133.500000 0.000000 0.000000 1.000000 0.000000
50% 1.000000 153.000000 0.000000 0.800000 2.000000 0.000000
75% 2.000000 166.000000 1.000000 1.600000 2.000000 1.000000
max 2.000000 202.000000 1.000000 6.200000 3.000000 3.000000
thal num
count 301.000000 303.000000
mean 4.734219 0.937294
std 1.939706 1.228536
min 3.000000 0.000000
25% 3.000000 0.000000
50% 3.000000 0.000000
75% 7.000000 2.000000
max 7.000000 4.000000
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope \
0 63 1 1 145 233 1 2 150 0 2.3 3
1 67 1 4 160 286 0 2 108 1 1.5 2
2 67 1 4 120 229 0 2 129 1 2.6 2
3 37 1 3 130 250 0 0 187 0 3.5 3
4 41 0 2 130 204 0 2 172 0 1.4 1
ca thal num
0 0.0 6.0 0
1 3.0 3.0 2
2 2.0 7.0 1
3 0.0 3.0 0
4 0.0 3.0 0
Exercise 1: Path Modelling#
For this exercise, you will investigate the following hypotheses:
age
directly affects heart disease presence (num
)Cholesterol (
chol
) and resting blood pressure (trestbps
) mediate the realtionship betweenage
andnum
Max heart rate achieved (
thalach
) and exercise-induced angina (exang
) have direct effects on heart disease (num
).
Additional information: The num
variable is a categorical variable with values ranging from 0 to 4, where 0 indicates no heart disease, and 1 to 4 indicate the presence of heart disease, with increasing severity.
Your tasks therefore are:
Create and fit a path model for the stated hypotheses.
Print and interpret the relevant results
Create the path diagram and check if you did everything correctly.
# Define and fit the model
model = semopy.Model("""
num ~ age + chol + trestbps + thalach + exang
trestbps ~ age
chol ~ age
""")
# Print information about the fitting process
info = model.fit(df)
print(info)
# Print the model estimates
estimates = model.inspect(std_est=True)
print(estimates)
# Print the model fit statistics
stats = calc_stats(model)
print(stats)
# Show and save the model figure
semopy.semplot(model, "figures/heart_disease_model.png", std_ests=True)
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.283
Number of iterations: 32
Params: 0.006 0.001 0.008 -0.015 0.720 0.554 1.197 1353.233 1.116 290.727
lval op rval Estimate Est. Std Std. Err z-value \
0 trestbps ~ age 0.554256 0.281469 0.108551 5.105937
1 chol ~ age 1.196650 0.281656 0.234196 5.109621
2 num ~ age 0.006412 0.047124 0.007846 0.817216
3 num ~ chol 0.000686 0.021428 0.001650 0.415975
4 num ~ trestbps 0.007657 0.110819 0.003559 2.151459
5 num ~ thalach -0.015432 -0.287044 0.003117 -4.951595
6 num ~ exang 0.719554 0.274881 0.140075 5.136928
7 chol ~~ chol 1353.232783 0.920670 109.942648 12.308534
8 trestbps ~~ trestbps 290.727227 0.920775 23.619973 12.308534
9 num ~~ num 1.115768 0.740213 0.090650 12.308534
p-value
0 3.291591e-07
1 3.228059e-07
2 4.138047e-01
3 6.774283e-01
4 3.144000e-02
5 7.360750e-07
6 2.792666e-07
7 0.000000e+00
8 0.000000e+00
9 0.000000e+00
DoF DoF Baseline chi2 chi2 p-value chi2 Baseline CFI \
Value 11 18 85.692646 1.157963e-13 326.896179 0.758195
GFI AGFI NFI TLI RMSEA AIC BIC \
Value 0.73786 0.571043 0.73786 0.604319 0.149947 19.434372 56.5717
LogLik
Value 0.282814
Exercise 2: Quiz#
from jupyterquiz import display_quiz
display_quiz('https://raw.githubusercontent.com/mibur1/psy111/main/book/solutions/quiz/question3.json')
display_quiz('https://raw.githubusercontent.com/mibur1/psy111/main/book/solutions/quiz/question4.json')