7.3 Exercises#

Exercise 1: Loading the Data#

For today’s exercise we will use the Breast Cancer Wisconsin (Diagnostic). It is a collection of data used for predicting whether a breast tumor is malignant (cancerous) or benign (non-cancerous), containing information derived from images of breast mass samples obtained through fine needle aspirates.

The dataset consists of 569 samples with 30 features that measure various characteristics of cell nuclei, such as radius, texture, perimeter, and area. Each sample is labeled as either malignant (1) or benign (0).

  1. Please visit the documentation and familiarize yourself with the dataset.

  2. Take an initial look at the features (predictors) and targets (outcomes) through the .head() method.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from ucimlrepo import fetch_ucirepo

# Fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# Data (as pandas dataframes) 
X = breast_cancer_wisconsin_diagnostic.data.features 
y = breast_cancer_wisconsin_diagnostic.data.targets

# Convert y to a 1D array (this is the required input for the logistic regression model)
y = np.ravel(y)

# Print information 
print(breast_cancer_wisconsin_diagnostic.variables)
print(X.head())
print(y)

Exercise 2: Fitting the prediction model#

  1. Fit a logistic regression model using all predictors for predicting

  2. Get and print the accuracy of the model.

  3. Get and print the confusion matrix for the target variable.

  4. Review the classification report and interpret the results.

Hint: If you get a warning about convergence, try setting max_iter=10000 in the logistic regression class.

# TODO

Voluntary exercise#

  1. Try to create a custom plot which visualizes the confusion matrix It should contain:

    • The four squares of the matrix (color coded)

    • Labels of the actual values in the middle of each square

    • Labels for all squares

    • A colorbar

    • A title

  2. Use ConfusionMatrixDisplay() from scikit-learn to achieve the same goal (and see that sometimes it makes sense to not re-invent the wheel :))

# TODO