13.3 Exercises#

For the exercises, we will continue with the same data as in the previous sections. Again, we want to predict wage from age in the Mid-Atlantic Wage Dataset.

# Uncomment the following line when working in Colab
# !pip install ISLP  
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import patsy
import statsmodels.api as sm
from ISLP import load_data

df = load_data('Wage')
print(df.head())
   year  age            maritl      race        education              region  \
0  2006   18  1. Never Married  1. White     1. < HS Grad  2. Middle Atlantic   
1  2004   24  1. Never Married  1. White  4. College Grad  2. Middle Atlantic   
2  2003   45        2. Married  1. White  3. Some College  2. Middle Atlantic   
3  2003   43        2. Married  3. Asian  4. College Grad  2. Middle Atlantic   
4  2005   50       4. Divorced  1. White       2. HS Grad  2. Middle Atlantic   

         jobclass          health health_ins   logwage        wage  
0   1. Industrial       1. <=Good      2. No  4.318063   75.043154  
1  2. Information  2. >=Very Good      2. No  4.255273   70.476020  
2   1. Industrial       1. <=Good     1. Yes  4.875061  130.982177  
3  2. Information  2. >=Very Good     1. Yes  5.041393  154.685293  
4  2. Information       1. <=Good     1. Yes  4.318063   75.043154  

Exercise 1: Custom cut points#

  • Create a regression model to predict wage from age by using stepwise functions. However, instead of separating the data into 4 evenly sized bins, this time create custom cut points at age 30, 40, 50, 60, and 70.

  • Print and interpret the model summary

  • Plot the model

# TODO: Exercise 1

Exercise 2: Higher-order splines#

  1. Please fit and plot a second-order spline regression model with cut points at age 30, 50, and 70

  2. Please fit and plot a third-order spline regression model with cut points at age 30, 50, and 70. Discuss the differences between the models. Does the third-order term significantly improve the model?

# TODO: Exercise 2.1
# TODO: Exercise 2.2

Voluntary exercise 1: Choosing best model#

Now that we fitted multple models, one question remains: Which one do we choose? To decide on that we will use the AIC (note that there are also other measures). Feel free to use whatever cut points you like.

Hints: Try using a loop to achieve this!

  • You can create a list of formulas ( e.g., bs(age, knots=(20,40,60,80), degree=0), bs(age, knots=(20,40,60,80), degree=1), …) and iterate over them.

  • You can use the .aic attribute of a fitted model to extract the AIC.

# TODO: Voluntary exercise 1

Voluntary exercise 2: Dynamic plotting#

You might have noticed that manually changing cut point values can be tedious. This is called hardcoding, where fixed values are embedded directly in the script. A better approach is softcoding, where values are parameterized, allowing for easy updates without altering the code itself. You will also learn more about this in your MATLAB course.

  • Grab a model of your choice and modify the code so that the cut points are marked by purple, dotted, vertical lines.

  • Try to make your code dynamic, meaning it dynamically accepts cut points provided in the cut_points list and automatically fits and plots the model accordingly.

  • Add a legend to the plot and label all lines and and data.

# TODO: Voluntary exercise 2