10.3 Exercises#

Exercise 1: The Dataset#

Remember the diabetes dataset which we used in a few weeks ago to predict the presence of diabetes. This week, our goal will be to explore the six blood serum measures, aiming to find underlying factors within them.

The data is already loaded and processed. Please do the following:

  1. Print and read through the description of the dataset using the .DESCR attribute

  2. Inspect the DataFrame and check if it contains the correct variables

  3. Plot the correlation matrix of the blood serum measures (e.g. with sns.heatmap() or plt.imshow()).

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets

# Load the six blood serum measures (s1-s6) of the diabetes dataset
diabetes_data = datasets.load_diabetes(as_frame=True)
df = diabetes_data.data[['s1', 's2', 's3', 's4', 's5', 's6']]

# TODO: Exercise 1

Exercise 2: Fitting the Model#

Use factor_analyzer package to fit a first non-rotated model with the minres optimizer and the number of factors corresponding to the number of items. Determine the most suitable number of factors using the Kaiser criterion.

from factor_analyzer import FactorAnalyzer

# TODO: Exercise 2

Exercise 3: Loadings & Communalities#

After selecting the number of factors, fit a final model with oblimin rotation. Print the rotated factor loadings and communalities. Do the communalities suggest a good model fit?

# TODO: Exercise 3

Voluntary exercise 1: Improving the Fit#

Look again at the Communalities. Some variables are badly represented by the factor structure. Exclude them and fit the model again. Did the fit improve?

# TODO: Voluntary exercise 1