library(tidyverse)
library(tidymodels)
library(knitr)AE 06: Inference for multiple linear regression
Palmer penguins
Go to the course GitHub organization and locate your ae-06 repo to get started.
Data: Palmer penguins
The penguins data set contains data for penguins found on three islands in the Palmer Archipelago, Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. These data can be found in the palmerpenguins R package.
penguins <- palmerpenguins::penguins |>
select(body_mass_g, flipper_length_mm, bill_length_mm, species) |>
drop_na()For this analysis, we will drop observations that have missing values for any of the variables of interest. In practice, we want to be careful about how this might bias our conclusions.
Main effects model
Let’s fit the main effects model using flipper_length_mm, species, and bill_length_mm.
penguins_fit <- lm(body_mass_g ~ flipper_length_mm +
species + bill_length_mm,
data = penguins)
tidy(penguins_fit, conf.int = TRUE) |>
kable(digits = 3)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -3904.387 | 529.257 | -7.377 | 0.000 | -4945.450 | -2863.324 |
| flipper_length_mm | 27.429 | 3.176 | 8.638 | 0.000 | 21.182 | 33.675 |
| speciesChinstrap | -748.562 | 81.534 | -9.181 | 0.000 | -908.943 | -588.182 |
| speciesGentoo | 90.435 | 88.647 | 1.020 | 0.308 | -83.937 | 264.807 |
| bill_length_mm | 61.736 | 7.126 | 8.664 | 0.000 | 47.720 | 75.753 |
Bootstrap CI
Construct the bootstrap CI
Fill in the code to construct a bootstrap distribution for speciesGentoo. Remove #| eval: false after filling in the code.
n = 100
set.seed(210)
boot_dist <- penguins |>
specify(______) |>
generate(reps = n, type = "bootstrap") |>
fit()Confidence interval for speciesGentoo
Use the bootstrap distribution to compute the standard error for speciesGentoo. How does it compare to the standard error in the model output? Is this what you expected? Why or why not?
# add codeConstruct the 95% confidence interval for speciesGentoo. How does it compare to the 95% CI in the model output? Is this what you expect? Why or why not?
# add codePermutation test
Similar to simple linear regression, we can use permutation sampling for simulation-based hypothesis tests for coefficients in multiple linear regression.
Now that we have multiple predictors, there are many ways to permute the data. We will discuss two here:
Permute the response variable.
Permute the predictor of interest.
We will use both methods to conduct a hypothesis test for the coefficient of speciesGentoo.
Permute response variable
Fill in the code to construct permutation samples by permuting the response variable for the variable in Main effects model. We will also compute the p-value for the coefficient. Remove #| eval: false after you have filled in the code.
n = 100
set.seed(210)
# construct null distribution
null_dist <- _____ |>
specify(______) |>
hypothesize(null = "independence") |>
generate(reps = _____, type = "permute") |>
fit()
# observed fit
observed_fit <- penguins |>
specify(body_mass_g ~ flipper_length_mm +
species + bill_length_mm) |>
fit()
# compute the p-value
get_p_value(
____,
obs_stat = ____,
direction = "two-sided"
)What is your conclusion about the coefficient of speciesGentoo based on the p-value?
Permute single predictor
Now let’s construct the permutation sample by permuting only the variable species. We do so by adding the variables argument to the generate function. Remove #| eval: false after you have filled in the code.
n = 100
set.seed(210)
# construct null distribution
null_dist_2 <- _____ |>
specify(______) |>
hypothesize(null = "independence") |>
generate(reps = _____, type = "permute",
variables = c(species)) |>
fit()
# observed fit
observed_fit <- penguins |>
specify(body_mass_g ~ flipper_length_mm +
species + bill_length_mm) |>
fit()
# compute the p-value
get_p_value(
____,
obs_stat = ____,
direction = "two-sided"
)What is your conclusion about the coefficient of speciesGentoo based on the p-value?
Different conclusions
Why did we get different conclusions? Hint: Think about what structure / variable relationships are preserved or broken in each approach.
Wrapping up
Once you’ve completed the AE:
Render the document to produce the PDF with all of your work from today’s class.
Push all your work to your AE repo on GitHub. You’re done! 🎉