AE 11: Model selection for logistic regression

Published

March 26, 2026

Important

Go to the course GitHub organization and locate your ae-11 repo to get started.

Packages

library(tidyverse)
library(tidymodels)
library(knitr)

leukemia <- read_csv("data/leukemia.csv") |>
  mutate(Resp = factor(Resp))

Response to Leukemia treatment

Today’s data is from a study where 51 untreated adult patients with Acute Myeloid Leukemia who were given a course of treatment, and they were assessed as to their response to the treatment.¹

The goal of today’s analysis is to use pre-treatment factors to predict how likely it is a patient will respond to the treatment.

We will use the following variables:

Age: Age at diagnosis (in years)
Smear: Differential percentage of blasts
Infil: Percentage of absolute marrow leukemia infiltrate
Index: Percentage labeling index of the bone marrow leukemia cells
Blasts: Absolute number of blasts, in thousands
Temp: Highest temperature of the patient prior to treatment, in degrees Fahrenheit
Resp: 1 = responded to treatment or 0 = failed to respond

Exercise 1

Begin by splitting the data into training (80%) and testing (20%) sets. Use 210 for the seed.

Exercise 2

We will begin by considering the model with all pre-treatment variables: Age, Smear, Infil, Index, Blasts and Temp. Fit a model using these six variables to predict whether a patient responded to the treatment (Resp).

Which variables are statistically significant using a threshold of \(\alpha = 0.05\)?

Exercise 3

We want to compare the model from Exercise 2 that includes all pre-treatment predictors to a model that only includes the statistically significant predictors. We will conduct cross validation using AUC on the assessment set as the criteria for selecting a final model.

Why might we prefer cross-validation with AUC over AIC/BIC on the training data for this analysis?

Exercise 4

Complete the code below to conduct 4-fold cross validation on the training data for the model in Exercise 2 with all pre-treatment predictors. Use 210 for the seed. :::

Important

Remove #| eval: false when you have filled in the code.

# define the folds
folds <- vfold_cv(____, v = 4)

# specify workflow for the model
______ <- workflow() |> 
  add_model(logistic_reg()) |> 
  add_formula(__________) 

# CV for model 
cross_validation <- _______ |> 
  fit_resamples(resamples = folds)

# compute performance statistics for the model
collect_metrics(cross_validation, summarize = TRUE)

Exercise 5

Conduct cross validation for the model that only includes the predictors from Exercise 2 that are statistically significant at \(\alpha = 0.05\). Use 210 for the seed.

Exercise 6

Which model do you select based on the results from cross validation? Evaluate how well the selected model performs on the testing data.

Wrapping up

Important

Render the document to produce the PDF with all of your work from today’s class.

Push all your work to your AE repo on GitHub. You’re done! 🎉

Footnotes

The data set is from the Stat2Data R package. This AE is adapted from exercises in Stat 2.↩︎