10.3 Exercises#
Exercise 1: The Dataset#
Remember the diabetes
dataset which we used in a few weeks ago to predict the presence of diabetes. This week, our goal will be to explore the six blood serum measures, aiming to find underlying factors within them.
The data is already loaded and processed. Please do the following:
Print and read through the description of the dataset using the
.DESCR
attributeInspect the DataFrame and check if it contains the correct variables
Plot the correlation matrix of the blood serum measures (e.g. with
sns.heatmap()
orplt.imshow()
).
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
# Load the six blood serum measures (s1-s6) of the diabetes dataset
diabetes_data = datasets.load_diabetes(as_frame=True)
df = diabetes_data.data[['s1', 's2', 's3', 's4', 's5', 's6']]
# TODO: Exercise 1
Exercise 2: Fitting the Model#
Use factor_analyzer
package to fit a first non-rotated model with the minres
optimizer and the number of factors corresponding to the number of items. Determine the most suitable number of factors using the Kaiser criterion.
from factor_analyzer import FactorAnalyzer
# TODO: Exercise 2
Exercise 3: Loadings & Communalities#
After selecting the number of factors, fit a final model with oblimin
rotation. Print the rotated factor loadings and communalities. Do the communalities suggest a good model fit?
# TODO: Exercise 3
Voluntary exercise 1: Improving the Fit#
Look again at the Communalities. Some variables are badly represented by the factor structure. Exclude them and fit the model again. Did the fit improve?
# TODO: Voluntary exercise 1