13.3 Exercises#
For the exercises, we will continue with the same data as in the previous sections. Again, we want to predict wage
from age
in the Mid-Atlantic Wage Dataset.
# Uncomment the following line when working in Colab
# !pip install ISLP
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import patsy
import statsmodels.api as sm
from ISLP import load_data
df = load_data('Wage')
print(df.head())
year age maritl race education region \
0 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic
1 2004 24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
2 2003 45 2. Married 1. White 3. Some College 2. Middle Atlantic
3 2003 43 2. Married 3. Asian 4. College Grad 2. Middle Atlantic
4 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic
jobclass health health_ins logwage wage
0 1. Industrial 1. <=Good 2. No 4.318063 75.043154
1 2. Information 2. >=Very Good 2. No 4.255273 70.476020
2 1. Industrial 1. <=Good 1. Yes 4.875061 130.982177
3 2. Information 2. >=Very Good 1. Yes 5.041393 154.685293
4 2. Information 1. <=Good 1. Yes 4.318063 75.043154
Exercise 1: Custom cut points#
Create a regression model to predict
wage
fromage
by using stepwise functions. However, instead of separating the data into 4 evenly sized bins, this time create custom cut points at age 30, 40, 50, 60, and 70.Print and interpret the model summary
Plot the model
# TODO: Exercise 1
Exercise 2: Higher-order splines#
Please fit and plot a second-order spline regression model with cut points at age 30, 50, and 70
Please fit and plot a third-order spline regression model with cut points at age 30, 50, and 70. Discuss the differences between the models. Does the third-order term significantly improve the model?
# TODO: Exercise 2.1
# TODO: Exercise 2.2
Voluntary exercise 1: Choosing best model#
Now that we fitted multple models, one question remains: Which one do we choose? To decide on that we will use the AIC
(note that there are also other measures). Feel free to use whatever cut points you like.
Hints: Try using a loop to achieve this!
You can create a list of formulas ( e.g.,
bs(age, knots=(20,40,60,80), degree=0)
,bs(age, knots=(20,40,60,80), degree=1)
, …) and iterate over them.You can use the
.aic
attribute of a fitted model to extract the AIC.
# TODO: Voluntary exercise 1
Voluntary exercise 2: Dynamic plotting#
You might have noticed that manually changing cut point values can be tedious. This is called hardcoding, where fixed values are embedded directly in the script. A better approach is softcoding, where values are parameterized, allowing for easy updates without altering the code itself. You will also learn more about this in your MATLAB course.
Grab a model of your choice and modify the code so that the cut points are marked by purple, dotted, vertical lines.
Try to make your code dynamic, meaning it dynamically accepts cut points provided in the
cut_points
list and automatically fits and plots the model accordingly.Add a legend to the plot and label all lines and and data.
# TODO: Voluntary exercise 2