6.4 Exercises#

Data preparation#

We will use the diabetes data set from the sklearn library for the following three exercises.

  1. Loading the data

    • The code for loading the dataset is already provided. Look at the documentation of the load_diabetes() function to familiarize yourself with the function and its outputs.

    • Use the DESCR attribute of the data set to get its description. Understand the variables, their meanings, and how they relate to the medical context (e.g., what each feature like BMI, age, and blood pressure represents in the context of diabetes).

  2. Preparing the data

    • Rename the target column of the to diabetes by using the .rename() method of the DataFrame. Search for its documentation if you are unsure how to use it.

    • Print the DataFrame for visual inspection.

Hints:

  • If you’re using a local Python installation, you will need to install some packages:pip install scikit-learn seaborn pingouin. Make sure to install them in the psy111 environment

  • If you are using Goolge Colab, you only need to install the pingouin package by creating a code cell at the top of the script and writing: !pip install pingouin.

  • If you’re using a Jupyter Notebook, make sure to select any of “View as a scrollable element or open in a text editor” at the bottom of the output to see the entire description.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import pingouin as pg
from sklearn import datasets
# 1. Load the data
diabetes = datasets.load_diabetes(as_frame=True)
diabetes_df = diabetes.frame

# 2. Preparing the data

Exercise 1: Multiple linear regression#

  1. Estimate a multiple linear regression model using bmi, bp, and s5 to predict diabetes progression (diabetes).

  2. Estimate a second model with different predictors.

  3. Compare the models. What is their R-squared value? What does it tell you about the performance of the models?

Tip: When working in Jupyter Notebooks, it might make sense to put different models/computations in separate code cells, so they can be evaluated individually and the outputs are easier to read.

# Exercise 1.1
# Exercise 1.2

Exercise 2: Correlations#

  1. Correlations

    • Compute the correlation matrix for all columns in the data

    • Print the shape of the correlation matrix. What shape does it have, and why?

  2. Visualizing the data

    • Plot the correlation matrix by using seaborn’s heatmap() function. The plot should have the following featurees:

      • The cells in the plot should be annotated with the correlation values

      • The colormap should be "coolwarm" (it’s a good choice for correlation values)

      • The colorbar should range from -1 to 1 (because this is the full range correlation values can take)

      • The cells in the plot should be quare (just because it looks nice)

      • The cells should be separated with lines of width 1 (also because it looks nice)

    • Throug visual inspection, identify the variable that shows the highest correlation (positive or negative) with the target variable (diabetes progression)

# Exercise 2

Exercise 3: Partial Correlation#

Age could be influencing both BMI and diabetes progression. As people age, both their BMI and risk for diabetes may increase, potentially inflating the observed correlation between BMI and diabetes progression. By holding age constant, we can test whether the relationship between BMI and diabetes progression persists independently of age.

Hypothesis:

  • Null Hypothesis (\(H_0\)​): There is no relationship between BMI and diabetes progression after controlling for age.

  • Alternative Hypothesis (\(H_1\)​): There is a relationship between BMI and diabetes progression, even after controlling for age.

Tasks:

  • Use the pingouin library to calculate the partial correlation between BMI and diabetes progression, controlling for age.

  • Compare the partial correlation coefficient to the original Pearson correlation coefficient. Did the correlation decrease after accounting for age? What does this suggest about age as a confounding factor?

# Exercise 3

Voluntary exercise 1: Data wrangling#

We have previously loaded the diabetes dataset with the as_frame=True argument. If we do not specify this argument, the combined DataFrame will not be provided, but rather the data, target, and labels will be returned separately.

  1. Familiarize yourself with the returns of the load_diabetes()function.

  2. What kind of data types are the data, target, and labels?

  3. Manually create the joint DataFrame by combining the data, target, and labels.

  4. Verify that your operations were succesful (e.g. by printing the joint DataFrame).

Hint: There are multiple ways for creating a joint DataFrame. Have a look at section 5.2 if you need a refresher. You could, for example, join two DataFrames/Serier, or you could just add a new column to an existing DataFrame. Feel free to experiment! :)

# Voluntary exercise 1