Model comparison

Author

Prof. Maria Tackett

Published

Mar 03, 2026

Announcements

Lab 05 due Thursday at 11:59pm
HW 03 due Tuesday, March 17 at 11:59pm
Statistics experience due April 15
Please provide mid-semester feedback by Friday: https://duke.qualtrics.com/jfe/form/SV_3lvkRQbz7PMuJVA

Topics

Root mean square error
ANOVA for multiple linear regression and sum of squares
Comparing models with \(Adj. R^2\)
Occam’s razor and parsimony
Cross validation

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(patchwork)
library(knitr)
library(kableExtra)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Introduction

Data: Restaurant tips

Which variables help us predict the amount customers tip at a restaurant? To answer this question, we will use data collected in 2011 by a student at St. Olaf who worked at a local restaurant.

# A tibble: 169 × 4
     Tip Party Meal   Age   
   <dbl> <dbl> <chr>  <chr> 
 1  2.99     1 Dinner Yadult
 2  2        1 Dinner Yadult
 3  5        1 Dinner SenCit
 4  4        3 Dinner Middle
 5 10.3      2 Dinner SenCit
 6  4.85     2 Dinner Middle
 7  5        4 Dinner Yadult
 8  4        3 Dinner Middle
 9  5        2 Dinner Middle
10  1.58     1 Dinner SenCit
# ℹ 159 more rows

Variables

Predictors:

Party: Number of people in the party
Meal: Time of day (Lunch, Dinner, Late Night)
Age: Age category of person paying the bill (Yadult, Middle, SenCit)
Payment: Payment type (Cash, Credit, Credit/CashTip)

Response: Tip: Amount of tip

Response: `Tip`

Predictors

Relevel categorical predictors

tips <- tips |>
  mutate(
    Meal = fct_relevel(Meal, "Lunch", "Dinner", "Late Night"),
    Age  = fct_relevel(Age, "Yadult", "Middle", "SenCit")
  )

Predictors, again

Response vs. predictors

Fit and summarize model

tip_fit <- lm(Tip ~ Party + Age, data = tips)

tidy(tip_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-0.170	0.366	-0.465	0.643
Party	1.837	0.124	14.758	0.000
AgeMiddle	1.009	0.408	2.475	0.014
AgeSenCit	1.388	0.485	2.862	0.005

. . .

Is this a useful model? How well does this model perform?

RMSE

\[ RMSE = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{n}} = \sqrt{\frac{\sum_{i=1}^ne_i^2}{n}} \]

Ranges between 0 (perfect predictor) and infinity (terrible predictor)
Same units as the response variable
The value of RMSE is more useful for comparing across models than evaluating a single model

RMSE in R

Use the rmse() function from the yardstick package (part of tidymodels)

tip_aug <- augment(tip_fit)
rmse(tip_aug, truth = Tip, estimate = .fitted)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        2.02

Analysis of variance (ANOVA)

Analysis of Variance (ANOVA): Technique to partition variability in Y by the sources of variability

ANOVA

Main Idea: Decompose the total variation in the response into
- the variation that can be explained by the each of the variables in the model
- the variation that can’t be explained by the model (left in the residuals)
If the variation that can be explained by the variables in the model is greater than the variation in the residuals, this signals that the model might be “valuable” (at least one of the \(\beta\)’s not equal to 0)

Sum of Squares

\[ \begin{aligned} \color{#407E99}{SST} \hspace{5mm}&= &\color{#993399}{SSM} &\hspace{5mm} + &\color{#8BB174}{SSR} \\[10pt] \color{#407E99}{\sum_{i=1}^n(y_i - \bar{y})^2} \hspace{5mm}&= &\color{#993399}{\sum_{i = 1}^{n}(\hat{y}_i - \bar{y})^2} &\hspace{5mm}+ &\color{#8BB174}{\sum_{i = 1}^{n}(y_i - \hat{y}_i)^2} \end{aligned} \]

\(R^2\)

The coefficient of determination \(R^2\) is the proportion of variation in the response, \(Y\), that is explained by the regression model

. . .

\[ R^2 = \frac{SSM}{SST} = 1 - \frac{SSR}{SST} = 1 - \frac{686.44}{1913.11} = 0.641 \]

Model comparison

Two potential models

Let’s consider two models:

Model 1: Party, Age
Model 2: Party, Age, Payment

. . .

Which model is a better fit for the data?

Limitation of \(R^2\)

When we add a predictor to a model:

The residual sum of squares (SSR) can only decrease (or stay the same)
Therefore, \(R^2\) can only increase (or stay the same)

Why can’t we soley rely on \(R^2\)?

Why can’t we rely solely on \(R^2\) for model comparison?

🔗 https://forms.office.com/r/ZWQEhqkcRX

Limitation of \(R^2\)

If we choose models based only on \(R^2\):

We will always prefer models with more predictors
Even if those predictors add little real value

We need a measure that balances model fit and model complexity

Adjusted \(R^2\)

Adjusted \(R^2\) penalizes for unnecessary predictors

\[Adj. R^2 = 1 - \frac{SSR/(n-p-1)}{SST/(n-1)}\]

where

\(n\) is the number of observations used to fit the model
\(p\) is the number of terms (not including the intercept) in the model

Comparing models with \(Adj. R^2\)

tip_fit_1 <- lm(Tip ~ Party + Age , 
    data = tips)

glance(tip_fit_1) |> 
  select(r.squared, adj.r.squared)

# A tibble: 1 × 2
  r.squared adj.r.squared
      <dbl>         <dbl>
1     0.641         0.635

tip_fit_2 <- lm(Tip ~ Party + Age + Payment, 
      data = tips)

glance(tip_fit_2) |> 
  select(r.squared, adj.r.squared)

# A tibble: 1 × 2
  r.squared adj.r.squared
      <dbl>         <dbl>
1     0.644         0.633

Which model would we choose based on \(R^2\)?
Which model would we choose based on Adjusted \(R^2\)?