Lab 03: Multiple linear regression

Candy competition

Important

This lab is due on Thursday, February 12 at 11:59pm.

To be considered on time, the following must be done by the due date:

Final .qmd and .pdf files pushed to your GitHub repo
Final .pdf file submitted on Gradescope

Introduction

In today’s lab you will analyze data collected from an online experiment conducted by the now defunct data journalism website FiveThirtyEight to describe what makes the best candy.

Learning goals

By the end of the lab you will be able to…

transform and create new variables
fit and interpret multiple linear regression models
compare models to select the “best” one
collaborate using GitHub

Getting started

Click here to find your team for this week’s lab.
Go to the sta210-sp26 organization on GitHub. Click on the repo with the prefix lab-03-. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo, starting a new project in R, and configuring git.
Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.
Do not make any changes to the .qmd file until the instructions tell you to do so.

Workflow: Using Git and GitHub as a team

Important

Assign each person on your team a number 1 through 4. For teams of three, Team Member 1 can take on the role of Team Member 4.

The following exercises must be done in order. Only one person should type in the .qmd file, commit, and push updates at a time. When it is not your turn to type, you should still share ideas and contribute to the team’s discussion.

⌨️ Team Member 1: Hands on the keyboard.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!¹

Change the author to your team name and include each team member’s name in the author field of the YAML in the following format: Team Name: Member 1, Member 2, Member 3, Member 4.

Team Member 1: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd and PDF documents, and finally push the changes to GitHub.

Team Members 2, 3, 4: Once Team Member 1 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the updated name in your .qmd file.

Packages

The following packages are used in the lab.

library(tidyverse)
library(tidymodels)
library(knitr)
library(fivethirtyeight)

Data: Candy competition

The data from this lab comes from the FiveThirtyEight article The Ultimate Halloween Candy Power Ranking by Walt Hickey. To collect data, Hickey and collaborators at FiveThirtyEight set up an experiment people could vote on a series of randomly generated candy matchups (e.g. Reeses vs. Skittles). Click here to check out some of the matchups.

The data set contains the characteristics and win percentage from 85 candies in the experiment. The variables are

Variable	Description
`chocolate`	Does it contain chocolate?
`fruity`	Is it fruit flavored?
`caramel`	Is there caramel in the candy?
`peanutyalmondy`	Does it contain peanuts, peanut butter or almonds?
`nougat`	Does it contain nougat?
`crispedricewafer`	Does it contain crisped rice, wafers, or a cookie component?
`hard`	Is it a hard candy?
`bar`	Is it a candy bar?
`pluribus`	Is it one of many candies in a bag or box?
`sugarpercent`	The percentile of sugar it falls under within the data set. Values 0 - 1.
`pricepercent`	The unit price percentile compared to the rest of the set. Values 0 - 1.
`winpercent`	The overall win percentage according to 269,000 matchups. Values 0 - 100.

Use the code below to get a glimpse of the candy_rankings data frame in the fivethirtyeight R package.

glimpse(candy_rankings)

Rows: 85
Columns: 13
$ competitorname   <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate        <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity           <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel          <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar              <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent     <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent     <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…

Exercises

The goal of this analysis is to use multiple linear regression to understand the factors that make a good candy, as measured by winpercent.

Team Member 1: Type the team’s responses to exercises 1 - 2.

Exercise 1

Visualize the relationship between the response variable winpercent and one potential quantitative predictor. Write an observation from the graph.
Visualize the relationship between the response variable and one potential categorical predictor. Write an observation from the graph.

Tip

Include informative axis labels and titles on the visualizations.

Exercise 2

We will do some feature engineering² to transform and create new variables to consider for the model.

Create a categorical variable that breaks sugarpercent into quartiles:
- “Q1” if sugarpercent \(<\) the \(25^{th}\) percentile.
- “Q2” if \(25^{th}\) percentile \(\leq\) sugarpercent \(<\) \(50^{th}\) percentile.
- “Q3” if \(50^{th}\) percentile \(\leq\) sugarpercent \(<\) \(75^{th}\) percentile.
- “Q4” if sugarpercent \(\geq\) \(75^{th}\) percentile.
Multiply pricepercent * 100, so the variable ranges from 0 - 100% instead of 0 - 1.

Important

You will use these variables whenever sugarpercent and pricepercent are referenced in the remainder of the lab.

Team Member 1: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is clear afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the responses to exercises 1- 2.

Team Member 2: It’s your turn! Type the team’s response to exercises 3 - 4.

Exercise 3

Fit a model using sugarpercent, pricepercent, chocolate, peanutyalmondy, and the interaction between chocolate and peanutyalmondy to predict winpercent. Neatly display the model using 3 decimal places. Remember to use the variables created in the previous exercise.
Interpret the intercept in the context of the data.

Exercise 4

Use the model from the previous exercise to interpret the following in the context of the data:

Coefficient of sugarpercent = Q3.
Coefficient of chocolateTRUE
Coefficient of peanutyalmondyTRUE:chocolateTRUE

Team Member 2: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is clear afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the responses to exercises 3 - 4.

Team Member 3: It’s your turn! Type the team’s response to exercises 5 - 6.

Exercise 5

Let’s consider another model. Fit a model that includes chocolate, pricepercent, crispedricewafer, peanutyalmondy, sugarpercent , along with the interaction between pricepercent and peanutyalmondy . Neatly display the model using 3 decimal places. Remember to use the variables created in Exercise 2.
Show how the values for statistic and p.value are computed for the coefficient of pricepercent.

Exercise 6

Compute the Root Mean Square Error (RMSE) for the models fit in Exercise 3 and Exercise 5. The Root Mean Square Error is

\[ RMSE = \sqrt{\frac{\sum_{i=1}^n e_i^2}{n}} \]

Which model would you choose based on the results from part (a)? Briefly explain.

Team Member 3: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is clear afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the responses to exercises 5 - 6.

Team Member 4: It’s your turn! Type the team’s response to exercises 7.

Exercise 7

Use the model you selected to describe what generally makes a good candy, as measured by the win percentage. Include discussion about the interpretation of the model coefficients and the statistical significance of the predictors based on the model output.

Team Member 4: Render, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is clear afterwards and the rest of the team can see the completed lab.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the team’s completed lab!

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Reminder: you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit the assignment:

Access Gradescope through the STA 210 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Select the name of every team member.
Mark the pages for the “Exercises” question.
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” question.

Grading (10 points)

Component	Points
Completion	8
Workflow & formatting	2

7 pts: Complete all exercises
6 pts: Complete 6 exercises
5 pts: Complete 5 exercises
4 pts: Complete 4 exercises
3 pts: Complete 3 exercises
0 pts: Complete < 2 exercises

Workflow & formatting

The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member and updating the team name and date in the YAML.

2 pts: Meet all criteria
1 pt: Meet some criteria
0 pt: Meet no criteria

Footnotes

“Feature engineering entails reformatting predictor values to make them easier for a model to use effectively. This includes transformations and encodings of the data to best represent their important characteristics.” -from Tidy Modeling with R ↩︎
“Feature engineering entails reformatting predictor values to make them easier for a model to use effectively. This includes transformations and encodings of the data to best represent their important characteristics.” -from Tidy Modeling with R ↩︎