Model selection
Mar 26, 2026
Lab 06 due TODAY at 11:59pm
HW 04 due March 31 at 11:59pm
Statistics experience due April 15
SSMU Data Mini #3 - April 4
Recall the four candidate models for Buchanan vs. Bush in HW 03:
| Model | Response variable | Predictor variable | RMSE |
|---|---|---|---|
| 1 | Buchanan2000 | Bush2000 | 348.6 |
| 2 | log(Buchanan2000) | Bush2000 | 0.758 |
| 3 | Buchanan2000 | log(Bush2000) | 369.533 |
| 4 | log(Buchanan2000) | log(Bush2000) | 0.460 |
Why is it unreliable to use these values of RMSE to choose the best fit model?
Comparing logistic regression models using
Drop-in-deviance test
AIC
BIC
Cross validation for logistic regression
This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.
high_risk:
age: Age at exam time (in years)
education: 1 = Some High School, 2 = High School or GED, 3 = Some College or Vocational School, 4 = College
currentSmoker: 0 = nonsmoker, 1 = smoker
totChol: Total cholesterol (in mg/dL)
# A tibble: 4,086 × 6
age education TenYearCHD totChol currentSmoker high_risk
<dbl> <fct> <dbl> <dbl> <fct> <fct>
1 39 4 0 195 0 0
2 46 2 0 250 0 0
3 48 1 0 245 1 0
4 61 3 1 225 1 1
5 46 3 0 285 1 0
6 43 2 0 228 0 0
7 63 1 1 205 0 1
8 45 2 0 313 1 0
9 52 1 0 260 0 0
10 43 1 0 225 1 0
# ℹ 4,076 more rows
Using age, totChol, and currentSmoker
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -6.673 | 0.378 | -17.647 | 0.000 | -7.423 | -5.940 |
| age | 0.082 | 0.006 | 14.344 | 0.000 | 0.071 | 0.094 |
| totChol | 0.002 | 0.001 | 1.940 | 0.052 | 0.000 | 0.004 |
| currentSmoker1 | 0.443 | 0.094 | 4.733 | 0.000 | 0.260 | 0.627 |
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.697
There are two approaches for testing coefficients in logistic regression
(Wald) hypothesis test: Use to test
Drop-in-deviance test: Use to test…
| term | estimate |
|---|---|
| (Intercept) | -6.673 |
| age | 0.082 |
| totChol | 0.002 |
| currentSmoker1 | 0.443 |
| term | estimate |
|---|---|
| (Intercept) | -6.456 |
| age | 0.080 |
| totChol | 0.002 |
| currentSmoker1 | 0.445 |
| education2 | -0.270 |
| education3 | -0.232 |
| education4 | -0.035 |
\[ \begin{aligned} \log L = \sum\limits_{i=1}^n[y_i \log(\hat{\pi}_i) + (1 - y_i)\log(1 - \hat{\pi}_i)] \end{aligned} \]
Measure of how well the model fits the data
Higher values of \(\log L\) are better
Deviance = \(-2 \log L\)
Suppose there are two nested models:
We want to test the hypotheses
\[ \begin{aligned} H_0&: \beta_{q+1} = \dots = \beta_p = 0 \\ H_a&: \text{ at least one }\beta_j \text{ is not } 0 \end{aligned} \]
To do so, we will use the Drop-in-deviance test, also known as the Nested Likelihood Ratio test
Hypotheses:
\[ \begin{aligned} H_0&: \beta_{q+1} = \dots = \beta_p = 0 \\ H_a&: \text{ at least one }\beta_j \text{ is not } 0 \end{aligned} \]
Test Statistic: \[\begin{aligned}G &= \text{Deviance}_{reduced} - \text{Deviance}_{full} \\ &= (-2 \log L_{reduced}) - (-2 \log L_{full})\end{aligned}\]
P-value: \(P(\chi^2 > G)\), calculated using a \(\chi^2\) distribution with degrees of freedom equal to the number of parameters being tested
education to the model?Reduced model
education to the model?Compute the deviance for each model:
[1] 3224.812
[1] 3217.6
education to the model?Calculate the p-value using a pchisq(), with degrees of freedom equal to the number of new model terms in the second model:
education to the model?The p-value is 0.065. What is your conclusion?
anova() to conduct the drop-in deviance testtest = "Chisq" to conduct the drop-in-deviance testWe can use the drop-in-deviance test to test for subset of predictors in linear regression
The deviance for the linear regression model is the sum of squared residuals
\[ SSR = \sum_{i=1}^n(y_i - \hat{y}_i)^2 \]
The test statistic for the drop-in-deviance test is \[G = SSR_{\text{reduced}} - SSR_{full}\]
This is also called the Nested F Test
Estimators of prediction error and relative quality of models:
Akaike’s Information Criterion (AIC)1: \[AIC = -2\log L + 2(p+1)\]
Bayesian Information Criterion (BIC)2: \[ BIC = -2\log L + \log(n)\times(p+1)\]
\[ \begin{aligned} & AIC = -2\log L \color{black}{+ 2(p+1)} \\ & BIC = -2\log L + \color{black}{\log(n)\times(p+1)} \end{aligned} \]
where \(\log L\) is the log-likelihood evaluated at the maximum likelihood estimate, \((p+1)\) is the number of terms in the model, and \(n\) is the sample size
\[ \begin{aligned} & AIC = -2\log L \color{black}{+ 2(p+1)} \\ & BIC = -2\log L + \color{black}{\log(n)\times(p+1)} \end{aligned} \]
First Term: Decreases as p increases (assuming “step-wise” model selection approach)
Second term: Increases as p increases
If \(n \geq 8\), the penalty for BIC is larger than that of AIC, so BIC tends to favor more parsimonious models (i.e. models with fewer terms)
glance() functionLet’s look at the AIC for the model that includes age, totChol, and currentSmoker
Let’s compare the reduced and full models using AIC.
[1] 3232.812
[1] 3231.6
Based on AIC, which model would you choose?
Let’s compare the reduced and full models using BIC
[1] 3258.074
[1] 3275.807
Based on BIC, which model would you choose?
Cross validation for logistic regression