Exercises

Exercises#

Exercise 1: Linear Regression Recap#

Please implement regression models using statsmodels as well as the sklearn package? Which degree of polynomial will be most suited for the synthetic data provided in the code cell below?

Plot the resulting regression models.

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate synthetic data
n_samples = 100
X = np.linspace(-2, 2, n_samples).reshape(-1, 1)
y = X**2 + np.random.normal(scale=0.5, size=X.shape)

# TODO: Statsmodels
...

# TODO: Scikit-learn
...

# TODO: Plot the predictions
...

Exercise 2: Bias-variance Tradeoff#

To get a better understanding about the bias-variance tradeoff, we will fit polynomial regression models to synthetic data from a known function \(y=sin(x)\).

Please perform the following tasks:

Visualize the data. Which model do you think would be optimal?
Split the data into a training set (70%) and testing set (30%)
Fit polynomial regression models for degrees 1 to 15
Plot the errors against the model degrees

Hint: You can split the data with the train_test_split() function from sklearn.model_selection.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Generate synthetic data
np.random.seed(55)
n_samples = 100
X = np.linspace(0, 2*np.pi, n_samples).reshape(-1, 1)  # Reshape for sklearn
y = np.sin(X) + np.random.normal(scale=0.5, size=X.shape)

# 1. TODO: Visualize the data
...

# 2. TODO: Split the data into training and testing sets
...

# 3. TODO: Fit polynomial regression models for degrees 1 to 15
degrees = range(1, 16)
...

# 4. TODO: Plot the training and testing errors against polynomial degree
...

Exercise 3: Resampling Methods#

The dataset we are using for the exercise is the California Housing Dataset. It contains 20640 samples and 8 features. In this dataset, we have information regarding the demography (income, population, house occupancy) in the districts, the location of the districts (latitude, longitude), and general information regarding the house in the districts (number of rooms, number of bedrooms, age of the house). Since these statistics are at the granularity of the district, they corresponds to averages or medians.

Familiarize yourself with the dataset by exploring the documentation and looking at the data. What are the features and target?

from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True)
df = data.frame
...

Let’s have a quick look at the distribution of these features by plotting their histograms.

# TODO: Plot histograms
...

Prepare the data for cross validation. This will require you to have a variable (e.g. X) for the features and a variable for the target (e.g. y).
Set up a k-fold cross validation for a linear regression
- Choose an appropriate k
- Define the model
- Perform cross validation
- Use the mean squared error (MSE) to assess model performance

Hints:

For 1: You can achieve this by e.g. creating them from the DataFrame, or by using the return_X_y parameter on the fetch_california_housing() function
For 2: You can use the LinearRegession() model from sklearn. You can further evaluate the model and specify the (negative) MSE as a performance measure in cross_val_score()

# TODO: Prepare data
...

from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import cross_val_score 

# TODO: Implement CV
...

Exercise 4: LOOCV#

Use LOOCV and compare the average MSE
Get the minimum and maximum MSE value. Discuss the range!
Plot the MSE values in a histogram (the x-range should be from 0 to 6)
Calculate the median MSE and discuss if it might be a more appropriate measure than the mean

Hints

As we have 20640 observations this will probably take more than a minute to calculate. Feel free to subset the number of observations to e.g. 5000.

# TODO: Implement LOOCV
...