# load packages
library(tidyverse) # for data wrangling and visualization
library(tidymodels) # for modeling
library(openintro) # for the duke_forest dataset
library(scales) # for pretty axis labels
library(knitr) # for pretty tables
library(kableExtra) # also for pretty tables
library(patchwork) # arrange plots
# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))SLR: Mathematical models for inference cont’d
Announcements
Lab 01 due TODAY at 11:59pm
Statistics experience due April 21
SSMU Mini DataFest - February 8
- See Ed Discussion for announcement
Topics
Use mathematical models to
conduct a hypothesis test for the slope
construct confidence intervals for the slope
construct intervals for predictions
Computational setup
From last class
Mathematical representation of the model
\[ \begin{aligned} Y &= \text{Model} + \text{Error} \\[8pt] &= f(X) + \epsilon \\[8pt] &= E(Y|X) + \epsilon \\[8pt] &= \beta_0 + \beta_1 X + \epsilon \end{aligned} \]
where the errors are independent and normally distributed:
- independent: Knowing the error term for one observation doesn’t tell you anything about the error term for another observation
- normally distributed: \(\epsilon \sim N(0, \sigma_\epsilon^2)\)
Mathematical representation, visualized
\[ Y|X \sim N(\beta_0 + \beta_1 X, \sigma_\epsilon^2) \]

- Mean: \(\beta_0 + \beta_1 X\), the predicted value based on the regression model
- Variance: \(\sigma_\epsilon^2\), constant across the range of \(X\)
- How do we estimate \(\sigma_\epsilon^2\)?
Regression standard error
Once we fit the model, we can use the residuals to estimate the regression standard error, the average distance between the observed values and the regression line
\[ \hat{\sigma}_\epsilon = \sqrt{\frac{\sum_\limits{i=1}^n(y_i - \hat{y}_i)^2}{n-2}} = \sqrt{\frac{\sum_\limits{i=1}^ne_i^2}{n-2}} \]
Standard error of \(\hat{\beta}_1\)
The standard error of \(\hat{\beta}_1\) quantifies the sampling variability in the estimated slopes
\[ SE_{\hat{\beta}_1} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{(n-1)s_X^2}} \]
. . .
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |
| area | 159.48 | 18.17 | 8.78 | 0.00 |
Mathematical models for inference for \(\beta_1\)
Hypothesis test for \(\beta_1\)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |
| area | 159.48 | 18.17 | 8.78 | 0.00 |
\[ H_0: \beta_1 = 0 \hspace{2mm} \text{ vs }\hspace{2mm} \beta_1 \neq 0 \]
Hypothesis test for \(\beta_1\)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |
| area | 159.48 | 18.17 | 8.78 | 0.00 |
\[ T = \frac{\hat{\beta}_1 - 0}{SE_{\hat{\beta}_1}} = \frac{159.48 - 0}{18.17} = 8.78 \]
. . .
2 * pt(q = 8.78, df = 96, lower.tail = FALSE)[1] 6.19602e-14
Hypothesis test for \(\beta_1\)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |
| area | 159.48 | 18.17 | 8.78 | 0.00 |
- The data provide convincing evidence that the population slope \(\beta_1\) is different from 0.
- The data provide convincing evidence of a linear relationship between area and price of houses in Duke Forest.
Confidence intervals
Confidence interval for the slope
\[ \text{Estimate} \pm \text{ (critical value) } \times \text{SE} \]
. . .
\[ \hat{\beta}_1 \pm t^* \times SE_{\hat{\beta}_1} \]
where \(t^*\) is calculated from a \(t\) distribution with \(n-2\) degrees of freedom
Confidence interval: Critical value
# confidence level: 95%
qt(0.975, df = nrow(duke_forest) - 2)[1] 1.984984
# confidence level: 90%
qt(0.95, df = nrow(duke_forest) - 2)[1] 1.660881
# confidence level: 99%
qt(0.995, df = nrow(duke_forest) - 2)[1] 2.628016

95% CI for the slope: Calculation
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |
| area | 159.48 | 18.17 | 8.78 | 0.00 |
\[\hat{\beta}_1 = 159.48 \hspace{15mm} t^* = 1.98 \hspace{15mm} SE_{\hat{\beta}_1} = 18.17\]
. . .
\[ 159.48 \pm 1.98 \times 18.17 = (123.50, 195.46) \]
95% CI for the slope: Computation
tidy(df_fit, conf.int = TRUE, conf.level = 0.95) |>
kable(digits = 2)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 | 10847.77 | 222456.88 |
| area | 159.48 | 18.17 | 8.78 | 0.00 | 123.41 | 195.55 |
Intervals for predictions
Intervals for predictions
- Suppose we want to answer the question “What is the predicted sale price of a Duke Forest house that is 2,800 square feet?”
- We said reporting a single estimate for the slope is not wise, and we should report a plausible range instead
- Similarly, reporting a single prediction for a new value is not wise, and we should report a plausible range instead

Two types of predictions
Prediction for the mean: “What is the average predicted sale price of Duke Forest houses that are 2,800 square feet?”
Prediction for an individual observation: “What is the predicted sale price of a Duke Forest house that is 2,800 square feet?”
. . .
Which would you expect to be more variable? The average prediction or the prediction for an individual observation?
Based on your answer, how would you expect the widths of plausible ranges for these two predictions to compare?
Uncertainty in predictions
Confidence interval for the mean outcome: \[\large{\hat{y} \pm t_{n-2}^* \times \color{purple}{\mathbf{SE}_{\hat{\boldsymbol{\mu}}}}}\]
. . .
Prediction interval for an individual observation: \[\large{\hat{y} \pm t_{n-2}^* \times \color{purple}{\mathbf{SE_{\hat{y}}}}}\]
Standard errors
Standard error of the mean outcome: \[SE_{\hat{\mu}} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]
. . .
Standard error of an individual outcome: \[SE_{\hat{y}} = \hat{\sigma}_\epsilon\sqrt{1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]
Standard errors
Standard error of the mean outcome: \[SE_{\hat{\mu}} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]
Standard error of an individual outcome: \[SE_{\hat{y}} = \hat{\sigma}_\epsilon\sqrt{\mathbf{\color{purple}{\Large{1}}} + \frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]
Confidence interval
The 95% confidence interval for the mean outcome:
new_house <- tibble(area = 2800)
predict(df_fit, new_house, interval = "confidence", level = 0.95) |>
kable()| fit | lwr | upr |
|---|---|---|
| 563205.5 | 529351 | 597060.1 |
. . .
We are 95% confident that mean sale price of Duke Forest houses that are 2,800 square feet is between $529,351 and $597,060.
Prediction interval
The 95% prediction interval for an individual outcome:
new_house <- tibble(area = 2800)
predict(df_fit, new_house, interval = "prediction", level = 0.95) |>
kable()| fit | lwr | upr |
|---|---|---|
| 563205.5 | 226438.3 | 899972.7 |
. . .
We are 95% confident that predicted sale price of a Duke Forest house that is 2,800 square feet is between $226,438 and $899,973.
Comparing intervals

Extrapolation
Using the model to predict for values outside the range of the original data is extrapolation.
. . .
Calculate the prediction interval for the sale price of a “tiny house” in Duke Forest that is 225 square feet.

. . .
No, thanks!
Recap
Defined mathematical models to conduct inference for the slope
Used mathematical models to
calculate confidence interval for the slope
conduct a hypothesis test for the slope
construct intervals for predictions