SLR: Mathematical models for inference cont’d

Prof. Maria Tackett

January 29, 2026

Announcements

Topics

  • Use mathematical models to

    • conduct a hypothesis test for the slope

    • construct confidence intervals for the slope

    • construct intervals for predictions

Computational setup

# load packages
library(tidyverse)   # for data wrangling and visualization
library(tidymodels)  # for modeling
library(openintro)   # for the duke_forest dataset
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(kableExtra)  # also for pretty tables
library(patchwork)   # arrange plots

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

From last class

Mathematical representation of the model

\[ \begin{aligned} Y &= \text{Model} + \text{Error} \\[8pt] &= f(X) + \epsilon \\[8pt] &= E(Y|X) + \epsilon \\[8pt] &= \beta_0 + \beta_1 X + \epsilon \end{aligned} \]

where the errors are independent and normally distributed:

  • independent: Knowing the error term for one observation doesn’t tell you anything about the error term for another observation
  • normally distributed: \(\epsilon \sim N(0, \sigma_\epsilon^2)\)

Mathematical representation, visualized

\[ Y|X \sim N(\beta_0 + \beta_1 X, \sigma_\epsilon^2) \]

Image source: Introduction to the Practice of Statistics (5th ed)
  • Mean: \(\beta_0 + \beta_1 X\), the predicted value based on the regression model
  • Variance: \(\sigma_\epsilon^2\), constant across the range of \(X\)
    • How do we estimate \(\sigma_\epsilon^2\)?

Regression standard error

Once we fit the model, we can use the residuals to estimate the regression standard error, the average distance between the observed values and the regression line

\[ \hat{\sigma}_\epsilon = \sqrt{\frac{\sum_\limits{i=1}^n(y_i - \hat{y}_i)^2}{n-2}} = \sqrt{\frac{\sum_\limits{i=1}^ne_i^2}{n-2}} \]

Standard error of \(\hat{\beta}_1\)

The standard error of \(\hat{\beta}_1\) quantifies the sampling variability in the estimated slopes

\[ SE_{\hat{\beta}_1} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{(n-1)s_X^2}} \]

term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00

Mathematical models for inference for \(\beta_1\)

Hypothesis test for \(\beta_1\)

term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00

\[ H_0: \beta_1 = 0 \hspace{2mm} \text{ vs }\hspace{2mm} \beta_1 \neq 0 \]

Hypothesis test for \(\beta_1\)

term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00

\[ T = \frac{\hat{\beta}_1 - 0}{SE_{\hat{\beta}_1}} = \frac{159.48 - 0}{18.17} = 8.78 \]

2 * pt(q = 8.78, df = 96, lower.tail = FALSE)
[1] 6.19602e-14

Hypothesis test for \(\beta_1\)

term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00


  • The data provide convincing evidence that the population slope \(\beta_1\) is different from 0.
  • The data provide convincing evidence of a linear relationship between area and price of houses in Duke Forest.

Confidence intervals

Confidence interval for the slope

\[ \text{Estimate} \pm \text{ (critical value) } \times \text{SE} \]

\[ \hat{\beta}_1 \pm t^* \times SE_{\hat{\beta}_1} \]

where \(t^*\) is calculated from a \(t\) distribution with \(n-2\) degrees of freedom

Confidence interval: Critical value

# confidence level: 95%
qt(0.975, df = nrow(duke_forest) - 2)
[1] 1.984984
# confidence level: 90%
qt(0.95, df = nrow(duke_forest) - 2)
[1] 1.660881
# confidence level: 99%
qt(0.995, df = nrow(duke_forest) - 2)
[1] 2.628016

95% CI for the slope: Calculation

term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00

\[\hat{\beta}_1 = 159.48 \hspace{15mm} t^* = 1.98 \hspace{15mm} SE_{\hat{\beta}_1} = 18.17\]

\[ 159.48 \pm 1.98 \times 18.17 = (123.50, 195.46) \]

95% CI for the slope: Computation

tidy(df_fit, conf.int = TRUE, conf.level = 0.95) |> 
  kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 116652.33 53302.46 2.19 0.03 10847.77 222456.88
area 159.48 18.17 8.78 0.00 123.41 195.55

Intervals for predictions

Intervals for predictions

  • Suppose we want to answer the question “What is the predicted sale price of a Duke Forest house that is 2,800 square feet?”
  • We said reporting a single estimate for the slope is not wise, and we should report a plausible range instead
  • Similarly, reporting a single prediction for a new value is not wise, and we should report a plausible range instead

Two types of predictions

  1. Prediction for the mean: “What is the average predicted sale price of Duke Forest houses that are 2,800 square feet?”

  2. Prediction for an individual observation: “What is the predicted sale price of a Duke Forest house that is 2,800 square feet?”

  • Which would you expect to be more variable? The average prediction or the prediction for an individual observation?

  • Based on your answer, how would you expect the widths of plausible ranges for these two predictions to compare?

Uncertainty in predictions

Confidence interval for the mean outcome: \[\large{\hat{y} \pm t_{n-2}^* \times \color{purple}{\mathbf{SE}_{\hat{\boldsymbol{\mu}}}}}\]

Prediction interval for an individual observation: \[\large{\hat{y} \pm t_{n-2}^* \times \color{purple}{\mathbf{SE_{\hat{y}}}}}\]

Standard errors

Standard error of the mean outcome: \[SE_{\hat{\mu}} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

Standard error of an individual outcome: \[SE_{\hat{y}} = \hat{\sigma}_\epsilon\sqrt{1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

Standard errors

Standard error of the mean outcome: \[SE_{\hat{\mu}} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

Standard error of an individual outcome: \[SE_{\hat{y}} = \hat{\sigma}_\epsilon\sqrt{\mathbf{\color{purple}{\Large{1}}} + \frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

Confidence interval

The 95% confidence interval for the mean outcome:

new_house <- tibble(area = 2800)

predict(df_fit, new_house, interval = "confidence", level = 0.95) |>
  kable()
fit lwr upr
563205.5 529351 597060.1

We are 95% confident that mean sale price of Duke Forest houses that are 2,800 square feet is between $529,351 and $597,060.

Prediction interval

The 95% prediction interval for an individual outcome:

new_house <- tibble(area = 2800)

predict(df_fit, new_house, interval = "prediction", level = 0.95) |>
  kable()
fit lwr upr
563205.5 226438.3 899972.7

We are 95% confident that predicted sale price of a Duke Forest house that is 2,800 square feet is between $226,438 and $899,973.

Comparing intervals

Extrapolation

Using the model to predict for values outside the range of the original data is extrapolation.

Calculate the prediction interval for the sale price of a “tiny house” in Duke Forest that is 225 square feet.

Black tiny house on wheels

No, thanks!

Recap

  • Defined mathematical models to conduct inference for the slope

  • Used mathematical models to

    • calculate confidence interval for the slope

    • conduct a hypothesis test for the slope

    • construct intervals for predictions