AE 06: Inference for multiple linear regression

Palmer penguins

Published

February 5, 2026

Important

Go to the course GitHub organization and locate your ae-06 repo to get started.

library(tidyverse)
library(tidymodels)
library(knitr)

Data: Palmer penguins

The penguins data set contains data for penguins found on three islands in the Palmer Archipelago, Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. These data can be found in the palmerpenguins R package.

penguins <- palmerpenguins::penguins |>
  select(body_mass_g, flipper_length_mm, bill_length_mm, species) |>
  drop_na()
Caution

For this analysis, we will drop observations that have missing values for any of the variables of interest. In practice, we want to be careful about how this might bias our conclusions.

Main effects model

Let’s fit the main effects model using flipper_length_mm, species, and bill_length_mm.

penguins_fit <- lm(body_mass_g ~ flipper_length_mm +
                     species + bill_length_mm, 
                   data = penguins)

tidy(penguins_fit, conf.int = TRUE) |>
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -3904.387 529.257 -7.377 0.000 -4945.450 -2863.324
flipper_length_mm 27.429 3.176 8.638 0.000 21.182 33.675
speciesChinstrap -748.562 81.534 -9.181 0.000 -908.943 -588.182
speciesGentoo 90.435 88.647 1.020 0.308 -83.937 264.807
bill_length_mm 61.736 7.126 8.664 0.000 47.720 75.753

Bootstrap CI

Construct the bootstrap CI

Fill in the code to construct a bootstrap distribution for speciesGentoo. Remove #| eval: false after filling in the code.

n = 100
set.seed(210)

boot_dist <- penguins |>
  specify(______) |>
  generate(reps = n, type = "bootstrap") |>
  fit()

Confidence interval for speciesGentoo

Use the bootstrap distribution to compute the standard error for speciesGentoo. How does it compare to the standard error in the model output? Is this what you expected? Why or why not?

# add code

Construct the 95% confidence interval for speciesGentoo. How does it compare to the 95% CI in the model output? Is this what you expect? Why or why not?

# add code

Permutation test

Similar to simple linear regression, we can use permutation sampling for simulation-based hypothesis tests for coefficients in multiple linear regression.

Now that we have multiple predictors, there are many ways to permute the data. We will discuss two here:

  • Permute the response variable.

  • Permute the predictor of interest.

We will use both methods to conduct a hypothesis test for the coefficient of speciesGentoo.

Permute response variable

Fill in the code to construct permutation samples by permuting the response variable for the variable in Main effects model. We will also compute the p-value for the coefficient. Remove #| eval: false after you have filled in the code.

n = 100
set.seed(210)

# construct null distribution 
null_dist <- _____ |>
  specify(______) |>
  hypothesize(null = "independence") |>
  generate(reps = _____, type = "permute") |>
  fit()

# observed fit
observed_fit <- penguins |>
  specify(body_mass_g ~ flipper_length_mm +
                     species + bill_length_mm) |>
  fit()

# compute the p-value
get_p_value(
  ____,
  obs_stat = ____,
  direction = "two-sided"
)

What is your conclusion about the coefficient of speciesGentoo based on the p-value?

Permute single predictor

Now let’s construct the permutation sample by permuting only the variable species. We do so by adding the variables argument to the generate function. Remove #| eval: false after you have filled in the code.

n = 100
set.seed(210)

# construct null distribution 
null_dist_2 <- _____ |>
  specify(______) |>
  hypothesize(null = "independence") |>
  generate(reps = _____, type = "permute", 
           variables = c(species)) |>
  fit()

# observed fit
observed_fit <- penguins |>
  specify(body_mass_g ~ flipper_length_mm +
                     species + bill_length_mm) |>
  fit()

# compute the p-value
get_p_value(
  ____,
  obs_stat = ____,
  direction = "two-sided"
)

What is your conclusion about the coefficient of speciesGentoo based on the p-value?

Different conclusions

Why did we get different conclusions? Hint: Think about what structure / variable relationships are preserved or broken in each approach.

Wrapping up

Important

Once you’ve completed the AE:

  • Render the document to produce the PDF with all of your work from today’s class.

  • Push all your work to your AE repo on GitHub. You’re done! 🎉