Mar 05, 2026
Lab 05 due TODAY at 11:59pm
HW 03 due Tuesday, March 17 at 11:59pm
Statistics experience due April 15
Please provide mid-semester feedback by Friday: https://duke.qualtrics.com/jfe/form/SV_3lvkRQbz7PMuJVA
Exam 02 date proposed date change
In-class: Thursday, April 9
Take-home: Thursday, April 9 - Saturday, April 11
Please email me by tomorrow if you have any concerns
Which variables help us predict the amount customers tip at a restaurant? To answer this question, we will use data collected in 2011 by a student at St. Olaf who worked at a local restaurant.
# A tibble: 169 Γ 4
Tip Party Meal Age
<dbl> <dbl> <fct> <fct>
1 2.99 1 Dinner Yadult
2 2 1 Dinner Yadult
3 5 1 Dinner SenCit
4 4 3 Dinner Middle
5 10.3 2 Dinner SenCit
6 4.85 2 Dinner Middle
7 5 4 Dinner Yadult
8 4 3 Dinner Middle
9 5 2 Dinner Middle
10 1.58 1 Dinner SenCit
# βΉ 159 more rows
Predictors:
Party: Number of people in the partyMeal: Time of day (Lunch, Dinner, Late Night)Age: Age category of person paying the bill (Yadult, Middle, SenCit)Payment: Payment type (Cash, Credit, Credit/CashTip)The training set (i.e., the data used to fit the model) does not have the capacity to be a good arbiter of performance.
It is not an independent piece of information; predicting the training set can only reflect what the model already knows.
Suppose you give a class a test, then give them the answers, then provide the same test. The student scores on the second test do not accurately reflect what they know about the subject; these scores would probably be higher than their results on the first test.
We can reserve some data for a testing set that can be used to evaluate the model performance
Create training and testing sets using functions from the resample R package (part of tidymodels)
Step 1: Create an initial split:
Step 2: Save training data
Step 3: Save testing data
Resampling is only conducted on the training set. The test set is not involved. For each iteration of resampling, the data are partitioned into two subsamples:
Image source: Kuhn and Silge. Tidy modeling with R.
More specifically, v-fold cross validation β commonly used resampling technique:
Letβs give an example where v = 3β¦
Split data into training and test sets
Specify model
Create workflow
ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preprocessor: Formula
Model: linear_reg()
ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tip ~ Age + Party
ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Linear Regression Model Specification (regression)
Computational engine: lm
Randomly split your training data into 3 partitions:
There are 126 observations in the training data. How many observations are in a single assessment set?
# Resampling results
# 3-fold cross-validation
# A tibble: 3 Γ 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [84/42]> Fold1 <tibble [2 Γ 4]> <tibble [0 Γ 4]>
2 <split [84/42]> Fold2 <tibble [2 Γ 4]> <tibble [0 Γ 4]>
3 <split [84/42]> Fold3 <tibble [2 Γ 4]> <tibble [0 Γ 4]>
# A tibble: 2 Γ 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 2.14 3 0.195 pre0_mod0_post0
2 rsq standard 0.623 3 0.0470 pre0_mod0_post0
Note: These are calculated using the assessment data
# A tibble: 6 Γ 5
id .metric .estimator .estimate .config
<chr> <chr> <chr> <dbl> <chr>
1 Fold1 rmse standard 1.92 pre0_mod0_post0
2 Fold1 rsq standard 0.713 pre0_mod0_post0
3 Fold2 rmse standard 2.52 pre0_mod0_post0
4 Fold2 rsq standard 0.554 pre0_mod0_post0
5 Fold3 rmse standard 1.97 pre0_mod0_post0
6 Fold3 rsq standard 0.603 pre0_mod0_post0
To illustrate how CV works, we used v = 3:
This was useful for illustrative purposes, but v is often 5 or 10; we generally prefer 10-fold cross-validation as a default
Exploratory data analysis
Using training dataβ¦
Fit and evaluate candidate models using cross validation
Select the best fit model
Check model conditions and diagnostics
Repeat as needed until youβve landed on final model
Evaluate the final model performance using the test set
Tip
See Section 10.5.2 for more detailed workflow
Source: Introduction to Regression Analysis
βοΈ Have a good spring break βοΈ