12.3 Exercises#

For the exercises we will use another mock-dataset. Assume the following scenario:

A research team tested a new therapy for reducing perceived stress over the course of 30 days. The generated data includes 200 stress level measures on a scale from 0 (no stress) to 100 (strong stress) at different time points after therapy start.

The data is stored in two columns:

  • days: The day of the measurement (0 to 30) after therapy start

  • stress: The perceived stress level (0–100)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# ----------------------------------------
# Simulate "Stress Reduction Therapy" Data
# ---------------------------------------- 
np.random.seed(42)

# Create data
n_points = 200
days = np.linspace(0, 30, n_points)
x_shifted = days - 15
true_stress = -((x_shifted**3) / 60 - x_shifted) + 50

# Add random noise and ensure scores between 0 and 100
noise = np.random.normal(0, 6.0, size=n_points)
stress = true_stress + noise
stress = np.clip(stress, 0, 100)

# Plot the data
fig, ax = plt.subplots()
sns.scatterplot(x=days, y=stress, alpha=0.6, ax=ax)
ax.set(xlabel='Days', ylabel='Stress', title='Stress Reduction Therapy');
../../../_images/1431027df6538fee3b491be7431f18c05fc37a59b26da536aa97512e674cf79a.png

Exercise 1: Fitting a linear model#

Please fit a simple linear regression model to predict stress based on days. Please use the syntax as introduced in the previous chapters using PolynomialFeatures so it will be easy to explore higher-order models later.

You tasks are:

  1. Create features for a linear regression using PolynomialFeatures

  2. Fit a linear model

  3. Print and discuss the model summary

  4. Plot the regression line on top of the data

  5. Create a second plot for the residuals (either as a subplot or as two separate plots). Answer the following questions:

    • Are there patterns in the residuals that suggest the model is missing curvature?

    • Does the residual plot show systematic deviations (e.g., stress is consistently under/over-predicted at certain times)?

import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures

# TODO: Exercise 1

Exercise 2: Improving the model#

Do you think the linear model from the previous exercise can be improved? Which kind of polynomial might be suitable for the present data? Copy your code from the previous exercise and explore polynomials of different orders.

# TODO: Exercise 2

Voluntary exercise 1: Centering predictors#

Please center the TimeDays predictor and then fit a cubic model to the data. Print the model summary and discuss how the interpretation of the coefficients changes. Do you think it makes sense to center the predictr in this example?

# TODO: Voluntary exercise 1

Voluntary exercise 2: Data handling#

Now study the structure of study_time_p2. Instead of using the package from sklearn.preprocessing, PolynomialFeatures, find a way to manually create a matrix of shape (500, 3) with the same information as study_time_p2. Fit the new model using your manual_study_time_p2 and plot it in a new subplot next to the previous model. Are the two model comparable? Did anything change?

# ----------------------------------------
# Simulate "Study-grade" Data
# ---------------------------------------- 
np.random.seed(69)
study_time = np.linspace(1, 10, 500)
h = 6
k = 80
grades = -(k / (h**2)) * (study_time - h)**2 + k + np.random.normal(0, 8, study_time.shape)
grades = np.clip(grades, 0, 100) # ensure we only have grades between 0 and 100

# Create a DataFrame
df = pd.DataFrame({"study_time": study_time, "grades": grades})

# TODO: Voluntary exercise 2