library(tidyverse)
library(tidymodels)
library(knitr)
leukemia <- read_csv("data/leukemia.csv") |>
mutate(Resp = factor(Resp))AE 11: Model selection for logistic regression
Go to the course GitHub organization and locate your ae-11 repo to get started.
Packages
Response to Leukemia treatment
Today’s data is from a study where 51 untreated adult patients with Acute Myeloid Leukemia who were given a course of treatment, and they were assessed as to their response to the treatment.1
The goal of today’s analysis is to use pre-treatment factors to predict how likely it is a patient will respond to the treatment.
We will use the following variables:
Age: Age at diagnosis (in years)Smear: Differential percentage of blastsInfil: Percentage of absolute marrow leukemia infiltrateIndex: Percentage labeling index of the bone marrow leukemia cellsBlasts: Absolute number of blasts, in thousandsTemp: Highest temperature of the patient prior to treatment, in degrees FahrenheitResp: 1 = responded to treatment or 0 = failed to respond
Exercise 1
Begin by splitting the data into training (80%) and testing (20%) sets. Use 210 for the seed.
Exercise 2
We will begin by considering the model with all pre-treatment variables: Age, Smear, Infil, Index, Blasts and Temp. Fit a model using these six variables to predict whether a patient responded to the treatment (Resp).
Which variables are statistically significant using a threshold of \(\alpha = 0.05\)?
Exercise 3
We want to compare the model from Exercise 2 that includes all pre-treatment predictors to a model that only includes the statistically significant predictors. We will conduct cross validation using AUC on the assessment set as the criteria for selecting a final model.
Why might we prefer cross-validation with AUC over AIC/BIC on the training data for this analysis?
Exercise 4
Complete the code below to conduct 4-fold cross validation on the training data for the model in Exercise 2 with all pre-treatment predictors. Use 210 for the seed. :::
Remove #| eval: false when you have filled in the code.
# define the folds
folds <- vfold_cv(____, v = 4)
# specify workflow for the model
______ <- workflow() |>
add_model(logistic_reg()) |>
add_formula(__________)
# CV for model
cross_validation <- _______ |>
fit_resamples(resamples = folds)
# compute performance statistics for the model
collect_metrics(cross_validation, summarize = TRUE)Exercise 5
Conduct cross validation for the model that only includes the predictors from Exercise 2 that are statistically significant at \(\alpha = 0.05\). Use 210 for the seed.
Exercise 6
Which model do you select based on the results from cross validation? Evaluate how well the selected model performs on the testing data.
Wrapping up
Render the document to produce the PDF with all of your work from today’s class.
Push all your work to your AE repo on GitHub. You’re done! 🎉
Footnotes
The data set is from the Stat2Data R package. This AE is adapted from exercises in Stat 2.↩︎