Multiple linear regression

Types of predictors

Author

Prof. Maria Tackett

Published

February 03, 2026

Announcements

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)
library(colorblindr)
library(palmerpenguins)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Topics

  • Include indicator variables for categorical predictors in the regression model

  • Mean-center and standardize quantitative predictors

  • Use interaction terms in the model to capture differences in a predictor’s effect

Data: Palmer penguins

The penguins data set contains data for penguins found on three islands in the Palmer Archipelago, Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. These data can be found in the palmerpenguins R package.

. . .

# A tibble: 342 × 4
   body_mass_g flipper_length_mm bill_length_mm species
         <int>             <int>          <dbl> <fct>  
 1        3750               181           39.1 Adelie 
 2        3800               186           39.5 Adelie 
 3        3250               195           40.3 Adelie 
 4        3450               193           36.7 Adelie 
 5        3650               190           39.3 Adelie 
 6        3625               181           38.9 Adelie 
 7        4675               195           39.2 Adelie 
 8        3475               193           34.1 Adelie 
 9        4250               190           42   Adelie 
10        3300               186           37.8 Adelie 
# ℹ 332 more rows

Variables

Predictors:

  • bill_length_mm: Bill length in millimeters
  • flipper_length_mm: Flipper length in millimeters
  • species: Adelie, Gentoo, or Chinstrap species

Response: body_mass_g: Body mass in grams


The goal of this analysis is to use the bill length, flipper length, and species to predict body mass.

Response: body_mass_g

min median max iqr
2700 4050 6300 1200

Predictors

Response vs. predictors

Multiple linear regression model

penguin_fit <- lm(body_mass_g ~ flipper_length_mm + species + 
                bill_length_mm, data = penguins)

tidy(penguin_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -3904.387 529.257 -7.377 0.000
flipper_length_mm 27.429 3.176 8.638 0.000
speciesChinstrap -748.562 81.534 -9.181 0.000
speciesGentoo 90.435 88.647 1.020 0.308
bill_length_mm 61.736 7.126 8.664 0.000

Model equation

\[ \begin{align}\widehat{\text{body_mass_g}} = -3904.387 &+27.429 \times \text{flipper_length_mm}\\ & -748.562 \times \text{Chinstrap}\\ &+ 90.435 \times \text{Gentoo}\\ &+ 61.736 \times \text{bill_length_mm} \end{align} \]

Note

We will talk about why there are two terms in the model for species shortly!

Interpreting \(\hat{\beta}_j\)

  • The estimated coefficient \(\hat{\beta}_j\) is the expected change in the mean of \(Y\) when \(X_j\) increases by one unit, holding the values of all other predictor variables constant.


  • Example: The estimated coefficient for flipper_length_mm is 27.429. This means for each additional millimeter in a penguin’s flipper length, its body mass is expected to be greater by 27.429 grams, on average, holding species and bill length constant.

Prediction

What is the predicted body mass for a Gentoo penguin with a flipper length of 200 millimeters and bill length of 45 millimeters?


new_penguin <- tibble(
  flipper_length_mm  = 200, 
  species = "Gentoo", 
  bill_length_mm = 45
)

predict(penguin_fit, new_penguin)
       1 
4449.955 

Intervals for prediction

90% confidence interval for estimated mean body mass of Gentoo penguins with flipper length of 200 mm and bill length of 45 mm.

predict(penguin_fit, new_penguin, interval = "confidence", 
        level = 0.90) |> kable(digits = 3)
fit lwr upr
4449.955 4355.238 4544.671


90% prediction interval for estimated body mass of an individual Gentoo penguin with flipper length of 200 mm and bill length of 45 mm.

predict(penguin_fit, new_penguin, interval = "prediction", 
        level = 0.90) |> kable(digits = 3)
fit lwr upr
4449.955 3881.035 5018.875

Categorical predictors

Indicator variables

  • Suppose there is a categorical variable with \(k\) categories (levels)

  • We can make \(k\) indicator variables - one indicator for each category

  • An indicator variable (also called dummy variables) takes values 1 or 0

    • 1 if the observation belongs to that category
    • 0 if the observation does not belong to that category

Indicator variables for species

penguins <- penguins |>
  mutate(
    adelie = if_else(species == "Adelie", 1, 0),
    chinstrap = if_else(species == "Chinstrap", 1, 0),
    gentoo = if_else(species == "Gentoo", 1, 0)
  )

. . .

# A tibble: 3 × 4
  species   adelie chinstrap gentoo
  <fct>      <dbl>     <dbl>  <dbl>
1 Adelie         1         0      0
2 Gentoo         0         0      1
3 Chinstrap      0         1      0

Indicators in the model

  • We will use \(k-1\) of the indicator variables in the model.
  • The baseline is the category that doesn’t have a term in the model. This is also called the reference level.
  • The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.

. . .

penguins |>
  select(species, chinstrap, gentoo) |>
  slice(1, 152, 283) # select example observations
# A tibble: 3 × 3
  species   chinstrap gentoo
  <fct>         <dbl>  <dbl>
1 Adelie            0      0
2 Gentoo            0      1
3 Chinstrap         1      0

Using all \(k\) indicators in R model

Run the code below to fit a model using flipper_length_mm, bill_length_mm, and all the indicator variables for species to predict body_mass_g. What do you notice about the model output? Why did this happen?

Interpreting species

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -3904.387 529.257 -7.377 0.000 -4945.450 -2863.324
flipper_length_mm 27.429 3.176 8.638 0.000 21.182 33.675
speciesChinstrap -748.562 81.534 -9.181 0.000 -908.943 -588.182
speciesGentoo 90.435 88.647 1.020 0.308 -83.937 264.807
bill_length_mm 61.736 7.126 8.664 0.000 47.720 75.753

. . .

  • The baseline category is Adelie.
  • Penguins from the Chinstrap species are expected to have a body mass that is 748.562 grams less, on average, compared to penguins from the Adelie species, holding flipper length and bill length constant.

. . .

Interpret the coefficient of Gentoo in the context of the data.

Transforming quantitative predictors

Centering

  • Centering a quantitative predictor means shifting every value by some constant \(C\)

\[ X_{cent} = X - C \]

  • One common type of centering is mean-centering, in which every value of a predictor is shifted by its mean

  • Only quantitative predictors are centered

  • Center all quantitative predictors in the model for ease of interpretation, when using mean-centering

What is one reason we might want to center the quantitative predictors? What are the units of centered variables?

Centering

Use the scale() function with center = TRUE and scale = FALSE to mean-center variables

penguins <- penguins |>
  mutate(flipper_length_cent = scale(flipper_length_mm, center = TRUE, scale = FALSE), 
         bill_length_cent = scale(bill_length_mm, center = TRUE, scale = FALSE))

Original vs. mean-centered variable

Original variable


Mean SD
200.915 14.062

Mean-centered variable


Mean SD
0 14.062


. . .

Using mean-centered variables in the model

How do you expect the model to change if we use flipper_length_cent and bill_length_cent in the model?

. . .

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 4318.066 45.674 94.542 0.000 4228.225 4407.908
flipper_length_cent 27.429 3.176 8.638 0.000 21.182 33.675
speciesChinstrap -748.562 81.534 -9.181 0.000 -908.943 -588.182
speciesGentoo 90.435 88.647 1.020 0.308 -83.937 264.807
bill_length_cent 61.736 7.126 8.664 0.000 47.720 75.753

Original vs. mean-centered model

term estimate
(Intercept) -3904.387
flipper_length_mm 27.429
speciesChinstrap -748.562
speciesGentoo 90.435
bill_length_mm 61.736
term estimate
(Intercept) 4318.066
flipper_length_cent 27.429
speciesChinstrap -748.562
speciesGentoo 90.435
bill_length_cent 61.736

What changed? What is the same?

Interpret coefficient of mean-centered predictors

Interpret the coefficient of bill_length_cent in the context of the data.

🔗 https://forms.office.com/r/Cn8pp2r32U

Standardizing

  • Standardizing a quantitative predictor mean shifting every value by the mean and dividing by the standard deviation of that variable

\[ X_{std} = \frac{X - \bar{X}}{s_X} \]

  • Only quantitative predictors are standardized

  • Standardize all quantitative predictors in the model for ease of interpretation

What is one reason we might want to standardize the quantitative predictors? What are the units of standardized variables?

Standardizing

Use the scale() function with center = TRUE and scale = TRUE to standardized variables


penguins <- penguins |>
  mutate(flipper_length_std = scale(flipper_length_mm, center = TRUE, scale = TRUE), 
         bill_length_std = scale(bill_length_mm, center = TRUE, scale = TRUE))

Original vs. standardized variable

Original variable


Mean SD
200.915 14.062

Standardized variable


Mean SD
0 1


. . .

Using standardized variables in the model

How do you expect the model to change if we use flipper_length_std and bill_length_std in the model?

. . .

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 4318.066 45.674 94.542 0.000 4228.225 4407.908
flipper_length_std 385.696 44.654 8.638 0.000 297.862 473.531
speciesChinstrap -748.562 81.534 -9.181 0.000 -908.943 -588.182
speciesGentoo 90.435 88.647 1.020 0.308 -83.937 264.807
bill_length_std 337.055 38.902 8.664 0.000 260.533 413.577

Original vs. standardized model

term estimate
(Intercept) -3904.387
flipper_length_mm 27.429
speciesChinstrap -748.562
speciesGentoo 90.435
bill_length_mm 61.736
term estimate
(Intercept) 4318.066
flipper_length_std 385.696
speciesChinstrap -748.562
speciesGentoo 90.435
bill_length_std 337.055

What changed? What is the same?

Interaction terms

Interaction terms

  • Sometimes the relationship between a predictor variable and the response depends on the value of another predictor variable.
  • This is an interaction effect.
  • To account for this, we can include interaction terms in the model.

Bill length versus species

If the lines are not parallel, there is indication of a potential interaction effect, i.e., the slope of bill length may differ based on the species.

Interaction terms in R

Fit the model predicting body_mass_g using flipper_length_mm, bill_length_mm, speices, and the interaction between species and bill_length_mm.

  • Write the estimated regression equation for Adelie penguins.

  • Write the estimated regression equation for Chinstrap penguins.

Interaction term in model

penguin_int_fit <- lm(body_mass_g ~ flipper_length_mm + species + bill_length_mm + species * bill_length_mm,
      data = penguins)
term estimate std.error statistic p.value
(Intercept) -4297.905 645.054 -6.663 0.000
flipper_length_mm 27.263 3.175 8.586 0.000
speciesChinstrap 1146.287 726.217 1.578 0.115
speciesGentoo 54.716 619.934 0.088 0.930
bill_length_mm 72.692 10.642 6.831 0.000
speciesChinstrap:bill_length_mm -41.035 16.104 -2.548 0.011
speciesGentoo:bill_length_mm -1.163 14.436 -0.081 0.936

Interpreting interaction terms

  • What the interaction means: The effect of bill length on the body mass is 41.035 less when the penguin is from the Chinstrap species compared to the effect for the Adelie species, holding all else constant.
  • Interpreting bill_length_mm for Chinstrap: For each additional millimeter in bill length, we expect the body mass of Chinstrap penguins to increase by 31.657 grams (72.692 - 41.035), holding flipper length constant.

Summary


In general, how do

indicators for categorical predictors impact the model equation?

interaction terms impact the model equation?

Recap

  • Included indicator variables for categorical predictors in the regression model

  • Mean-centered and standardized quantitative predictors

  • Used interaction terms in the model to capture differences in a predictor’s effect

Next class