Path Modelling#

In this exercise, we will revisit the Cleveland Heart Disease dataset, which we already explored in the categorical regression session. This dataset is widely used in medical research and machine learning for predicting heart disease. It includes data from patients with suspected heart conditions and features a variety of clinical and demographic attributes.

As done before, we first load the dataset and combine features (predictors) and targets into a single DataFrame, before having a look at it:

import pandas as pd
import semopy
from semopy import calc_stats
from ucimlrepo import fetch_ucirepo
  
# Fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# Get data (they already are DataFrames) 
X = heart_disease.data.features 
y = heart_disease.data.targets 

# Create a combined DataFrame
df = pd.concat([X, y], axis=1)

print(df.describe())
print(df.head())
              age         sex          cp    trestbps        chol         fbs  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean    54.438944    0.679868    3.158416  131.689769  246.693069    0.148515   
std      9.038662    0.467299    0.960126   17.599748   51.776918    0.356198   
min     29.000000    0.000000    1.000000   94.000000  126.000000    0.000000   
25%     48.000000    0.000000    3.000000  120.000000  211.000000    0.000000   
50%     56.000000    1.000000    3.000000  130.000000  241.000000    0.000000   
75%     61.000000    1.000000    4.000000  140.000000  275.000000    0.000000   
max     77.000000    1.000000    4.000000  200.000000  564.000000    1.000000   

          restecg     thalach       exang     oldpeak       slope          ca  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  299.000000   
mean     0.990099  149.607261    0.326733    1.039604    1.600660    0.672241   
std      0.994971   22.875003    0.469794    1.161075    0.616226    0.937438   
min      0.000000   71.000000    0.000000    0.000000    1.000000    0.000000   
25%      0.000000  133.500000    0.000000    0.000000    1.000000    0.000000   
50%      1.000000  153.000000    0.000000    0.800000    2.000000    0.000000   
75%      2.000000  166.000000    1.000000    1.600000    2.000000    1.000000   
max      2.000000  202.000000    1.000000    6.200000    3.000000    3.000000   

             thal         num  
count  301.000000  303.000000  
mean     4.734219    0.937294  
std      1.939706    1.228536  
min      3.000000    0.000000  
25%      3.000000    0.000000  
50%      3.000000    0.000000  
75%      7.000000    2.000000  
max      7.000000    4.000000  
   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   1       145   233    1        2      150      0      2.3      3   
1   67    1   4       160   286    0        2      108      1      1.5      2   
2   67    1   4       120   229    0        2      129      1      2.6      2   
3   37    1   3       130   250    0        0      187      0      3.5      3   
4   41    0   2       130   204    0        2      172      0      1.4      1   

    ca  thal  num  
0  0.0   6.0    0  
1  3.0   3.0    2  
2  2.0   7.0    1  
3  0.0   3.0    0  
4  0.0   3.0    0  

Exercise 1: Path Modelling#

For this exercise, you will investigate the following hypotheses:

  1. age directly affects heart disease presence (num)

  2. Cholesterol (chol) and resting blood pressure (trestbps) mediate the realtionship between age and num

  3. Max heart rate achieved (thalach) and exercise-induced angina (exang) have direct effects on heart disease (num).

Additional information: The num variable is a categorical variable with values ranging from 0 to 4, where 0 indicates no heart disease, and 1 to 4 indicate the presence of heart disease, with increasing severity.

Your tasks therefore are:

  1. Create and fit a path model for the stated hypotheses.

  2. Print and interpret the relevant results

  3. Create the path diagram and check if you did everything correctly.

# Define and fit the model
model = semopy.Model("""
                     num ~ age + chol + trestbps + thalach + exang
                     trestbps ~ age 
                     chol ~ age
                     """)

# Print information about the fitting process
info = model.fit(df)
print(info)

# Print the model estimates
estimates = model.inspect(std_est=True)
print(estimates)

# Print the model fit statistics
stats = calc_stats(model)
print(stats)

# Show and save the model figure
semopy.semplot(model, "figures/heart_disease_model.png", std_ests=True)
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.283
Number of iterations: 32
Params: 0.006 0.001 0.008 -0.015 0.720 0.554 1.197 1353.233 1.116 290.727
       lval  op      rval     Estimate  Est. Std    Std. Err    z-value  \
0  trestbps   ~       age     0.554256  0.281469    0.108551   5.105937   
1      chol   ~       age     1.196650  0.281656    0.234196   5.109621   
2       num   ~       age     0.006412  0.047124    0.007846   0.817216   
3       num   ~      chol     0.000686  0.021428    0.001650   0.415975   
4       num   ~  trestbps     0.007657  0.110819    0.003559   2.151459   
5       num   ~   thalach    -0.015432 -0.287044    0.003117  -4.951595   
6       num   ~     exang     0.719554  0.274881    0.140075   5.136928   
7      chol  ~~      chol  1353.232783  0.920670  109.942648  12.308534   
8  trestbps  ~~  trestbps   290.727227  0.920775   23.619973  12.308534   
9       num  ~~       num     1.115768  0.740213    0.090650  12.308534   

        p-value  
0  3.291591e-07  
1  3.228059e-07  
2  4.138047e-01  
3  6.774283e-01  
4  3.144000e-02  
5  7.360750e-07  
6  2.792666e-07  
7  0.000000e+00  
8  0.000000e+00  
9  0.000000e+00  
       DoF  DoF Baseline       chi2  chi2 p-value  chi2 Baseline       CFI  \
Value   11            18  85.692646  1.157963e-13     326.896179  0.758195   

           GFI      AGFI      NFI       TLI     RMSEA        AIC      BIC  \
Value  0.73786  0.571043  0.73786  0.604319  0.149947  19.434372  56.5717   

         LogLik  
Value  0.282814  
../../_images/16a550e496537616ed459a93db6bd5dd618a58b91acbe63a460f6b3d1bedb0bf.svg

Exercise 2: Quiz#

from jupyterquiz import display_quiz

display_quiz('https://raw.githubusercontent.com/mibur1/psy111/main/book/solutions/quiz/question3.json')
display_quiz('https://raw.githubusercontent.com/mibur1/psy111/main/book/solutions/quiz/question4.json')