9.3 Exercises#
In this exercise, we will revisit the Cleveland Heart Disease dataset, which we already explored in the categorical regression session. This dataset is widely used in medical research and machine learning for predicting heart disease. It includes data from patients with suspected heart conditions and features a variety of clinical and demographic attributes.
As done before, we first load the dataset and combine features (predictors) and targets into a single DataFrame, before having a look at it:
# Uncomment the following lines if you are using Google Colab
#!pip install semopy
#!pip install ucimlrepo
#!pip install jupyterquiz
import pandas as pd
import semopy
from semopy import calc_stats
from ucimlrepo import fetch_ucirepo
# Fetch dataset
heart_disease = fetch_ucirepo(id=45)
# Get data (they already are DataFrames)
X = heart_disease.data.features
y = heart_disease.data.targets
# Create a combined DataFrame
df = pd.concat([X, y], axis=1)
print(df.describe())
print(df.head())
age sex cp trestbps chol fbs \
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.438944 0.679868 3.158416 131.689769 246.693069 0.148515
std 9.038662 0.467299 0.960126 17.599748 51.776918 0.356198
min 29.000000 0.000000 1.000000 94.000000 126.000000 0.000000
25% 48.000000 0.000000 3.000000 120.000000 211.000000 0.000000
50% 56.000000 1.000000 3.000000 130.000000 241.000000 0.000000
75% 61.000000 1.000000 4.000000 140.000000 275.000000 0.000000
max 77.000000 1.000000 4.000000 200.000000 564.000000 1.000000
restecg thalach exang oldpeak slope ca \
count 303.000000 303.000000 303.000000 303.000000 303.000000 299.000000
mean 0.990099 149.607261 0.326733 1.039604 1.600660 0.672241
std 0.994971 22.875003 0.469794 1.161075 0.616226 0.937438
min 0.000000 71.000000 0.000000 0.000000 1.000000 0.000000
25% 0.000000 133.500000 0.000000 0.000000 1.000000 0.000000
50% 1.000000 153.000000 0.000000 0.800000 2.000000 0.000000
75% 2.000000 166.000000 1.000000 1.600000 2.000000 1.000000
max 2.000000 202.000000 1.000000 6.200000 3.000000 3.000000
thal num
count 301.000000 303.000000
mean 4.734219 0.937294
std 1.939706 1.228536
min 3.000000 0.000000
25% 3.000000 0.000000
50% 3.000000 0.000000
75% 7.000000 2.000000
max 7.000000 4.000000
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope \
0 63 1 1 145 233 1 2 150 0 2.3 3
1 67 1 4 160 286 0 2 108 1 1.5 2
2 67 1 4 120 229 0 2 129 1 2.6 2
3 37 1 3 130 250 0 0 187 0 3.5 3
4 41 0 2 130 204 0 2 172 0 1.4 1
ca thal num
0 0.0 6.0 0
1 3.0 3.0 2
2 2.0 7.0 1
3 0.0 3.0 0
4 0.0 3.0 0
Exercise 1: Path Modelling#
For this exercise, you will investigate the following hypotheses:
age
directly affects heart disease presence (num
)Cholesterol (
chol
) and resting blood pressure (trestbps
) mediate the realtionship betweenage
andnum
Max heart rate achieved (
thalach
) and exercise-induced angina (exang
) have direct effects on heart disease (num
).
Additional information: The num
variable is a categorical variable with values ranging from 0 to 4, where 0 indicates no heart disease, and 1 to 4 indicate the presence of heart disease, with increasing severity.
Your tasks therefore are:
Create and fit a path model for the stated hypotheses.
Print and interpret the relevant results
Create the path diagram and check if you did everything correctly.
# TODO: Exercise 1
Exercise 2: Quiz#
from jupyterquiz import display_quiz
display_quiz('https://raw.githubusercontent.com/mibur1/psy111/main/book/solutions/quiz/question3.json')
display_quiz('https://raw.githubusercontent.com/mibur1/psy111/main/book/solutions/quiz/question4.json')