6.4 Exercises#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
from patsy.contrasts import Treatment
Exercise 1: Loading the Data#
For today’s exercise we will use the Cleveland Heart Disease dataset. It is a well-known dataset in the field of medical research and machine learning, particularly used for predicting heart disease. The dataset contains data collected from patients with suspected heart disease and includes various clinical and demographic attributes.
Please visit the documentation and familiarize yourself with the dataset.
Find the instructions for importing the data in Python. You can remove the printing of the metadata and variables, as this looks horrible in the notebook. Read it in the documentation instead.
Create a combined DataFrame, which combines the features and the targets along the first axis:
pd.concat([X, y], axis=1)
.Please check if your data is how you expect it to be. You can use functions like
.describe()
or.head()
.
# Exercise 1
Exercise 2: Visualizing the Data#
Plot age
against the diferent types of chest pain (cp
) using sns.boxplot()
. Incorporate the diagnosis of heart disease (num
) as the hue
. Use plt.xlabel()
and plt.ylabel()
to label the x-axis with ‘Chest Pain Type (cp)’ and the y-axis with ‘Age’.
Additional information: The num
variable is a categorical variable with values ranging from 0 to 4, where 0 indicates no heart disease, and 1 to 4 indicate the presence of heart disease, with increasing severity.
plt.figure(figsize=(10, 6))
sns.boxplot(x=?,y=?,hue=?,data=?)
plt.title('Age Distribution by Chest Pain Type')
plt.show()
Exercise 3: Dummy Coding#
Perform categorical regression with cp
as the outcome variable and age
as the predictors. For this:
In the documentation, inspect the “Additional Variable Information” to find out about the different levels of chest pain
Convert chest pain into a categorical variable
Apply a dummy coding scheme with typical angina as the reference category
Discuss the following points:
What do the coefficients tell you about the relationship between age and different types of chest pain?
Considering the explained variance and significance. Do the results suggest a relationship between chest pain type and age? Why or why not?
# Exercise 3
Exercise 4: Weighted Effects Coding#
Create a weighted effects coding contrast matrix with
age
as the outcome variable andcp
as the predictors. Use 1 (typical angina) as the reference category, as in the previous steps.Perform linear regression using
ols()
fromStatsmodels
with the weighted effects coding matrix.Compare and interpret the results against the previous dummy coding approach, specifically focusing on the impact of using a weighted reference category versus an unweighted reference. How does the weighting affect the interpretation of the relationship between cp and age?
# Exercise 4
Voluntary exercise 1#
With the model as previously designed in Exercise 2:
Manually create the contrast matrix and print it
Create and print the design matrix
Hint: You can create the design matrix from the contrast matrix, but you need to map each level in cp
to the corresponding contrast row: cp_mapping = {level: idx for idx, level in enumerate(levels)}
.
# Voluntary exercise 1
Voluntary exercise 2#
Implement contrast coding on the heart disease data set. There are no constraints, feel free to explore any contrasts.
# Voluntary exercise 2