Lab 02: Inference for regression using mathematical models

Coffee ratings

Important

This lab is due on Thursday, February 5 at 11:59pm. To be considered on time, the following must be done by the due date:

  • Final .qmd and .pdf files pushed to your GitHub repo

  • Final .pdf file submitted on Gradescope

Introduction

In today’s lab, you will analyze data from over 1,000 different coffees to explore the relationship between a coffee’s aftertaste and its overall quality.

Learning goals

By the end of the lab you will…

  • use mathematical models to conduct inference for the slope.
  • assess conditions for linear regression.

Getting started

  • Go to the sta210-sp26 organization on GitHub. Click on the repo with the prefix lab-02-. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo, starting a new project in R, and configuring git.

Packages

The following packages are used in the lab.

library(tidyverse)
library(tidymodels)
library(knitr)

Data: Coffee ratings

We will analyze data originally from the Coffee Quality Database. It was featured in the TidyTuesday weekly data visualization challenge in July 2020, and the data set for the lab was obtained from the #TidyTuesday GitHub repo. It includes information about the origin, producer, measures of various coffee characteristics, and the quality measure for over 1,000 coffees. The coffees can be reasonably treated as a random sample.

This lab will focus on the following variables:

  • aftertaste: Aftertaste rating, 0 (worst aftertaste) - 10 (best aftertaste)
  • total_cup_points: Rating of overall quality, 0 (worst quality) - 10 (best quality)

Click here for the definitions of all variables in the data set. Click here for more details about how these characteristics were measured.

coffee <- read_csv("data/coffee-ratings.csv")

Exercises


Important

Write all code and narrative in your Quarto file. Write all narrative in complete sentences. Make sure the teaching team can read all of your code in your PDF document. This means you will need to break up long lines of code. One way to help avoid long lines of code is to start a new line after every pipe (|>) and plus sign (+).

Goal: The goal of this analysis is to use linear regression to understand variability in overall quality rating based on the aftertaste rating in coffee.

Exercise 1

Visualize the relationship between the aftertaste rating and total cup points. Write two observations from the plot.

Exercise 2

Fit the linear model using the aftertaste rating to understand variability in the total cup points. Neatly display the model using three decimal places and include the 98% confidence interval for the model coefficients in the output.

Exercise 3

  1. Interpret the slope in the context of the data.

  2. Assume you are a coffee drinker using the model from Exercise 2. Would you drink a coffee represented by the intercept? Why or why not?

Exercise 4

You can obtain the predicted values and other observation-level statistics from the model using the augment() function. Create a data frame called coffee_aug by replacing the blank with the name of the fitted model.

coffee_aug <- augment(_____)
  1. Write code to “manually” compute the regression standard error, \(\hat{\sigma}_\epsilon\) using the residuals (stored in the column .resid).

  2. State the definition of the regression standard error in the context of the data.

Tip

You can check your answer to part (b) by using the code below to get \(\hat{\sigma}_\epsilon\)

glance(model_name)$sigma

Exercise 5

Do the data provide evidence of a statistically significant linear relationship between aftertaste rating and total cup points? Conduct a hypothesis test using mathematical models to answer this question.

  1. State the null and alternative hypotheses in words and in mathematical notation.

  2. What is the test statistic? State what the test statistic means in the context of this problem. Note: You do not need to show how the test statistic is computed.

  3. What distribution was used to calculate the p-value? Be specific.

  4. State the conclusion in the context of the data using a threshold of \(\alpha = 0.02\) to make your decision.

Exercise 6

  1. What is the critical value used to calculate the confidence interval displayed in Exercise 2? Show the code and output used to get your response.

  2. Is the confidence interval consistent with the conclusions from the hypothesis test? Briefly explain why or why not.

Exercise 7

  1. Compute the 98% confidence interval for the mean total cup points for coffees with aftertaste rating of 8.25. Interpret this interval in the context of the data.

  2. One coffee produced by the Ethiopia Commodity Exchange has an aftertaste of 8.25. Calculate the 98% prediction interval for the total cup points for this coffee. Interpret this interval in the context of the data.

  3. How do the intervals in parts (a) and (b) compare? If there are differences in the predictions and/or margin of error for the intervals, briefly explain why.

Exercise 8

We have conducted inference in the previous exercises making some inherent assumptions about the data. Therefore, we will check some model conditions to assess whether the assumptions hold, and thus evaluate the reliability of the inferential results. To do so, we will use the data frame coffee_aug from Exercise 4 that includes the residuals, predicted values, and other observation-level statistics from the model.

  1. Make a scatterplot of the residuals (.resid) vs. fitted values (.fitted). Use geom_hline() to add a horizontal dotted line at \(residuals = 0\).
Note

The linearity condition is satisfied if the residuals are randomly scattered (no distinguishable pattern or structure) in the plot of residuals vs. fitted values.

The equal variance condition is satisfied if the vertical spread of the residuals is relatively equal across the plot.

See Section 6.2.2 and Section 6.2.5 of the textbook for more on linearity and equal variance, respectively.

  1. Is the linearity condition satisfied? Briefly explain why or why not.

  2. Is the equal variance condition satisfied? Briefly explain why or why not.

Exercise 9

Note

The normality condition is satisfied if the distribution of the residuals is approximately normal. This condition can be relaxed if the sample size is sufficiently large \((n > 30)\).

See Section 6.2.4 of the textbook for more on normality.

  1. Make a histogram or density plot of the residuals (.resid).

  2. Is the normality condition satisfied? Briefly explain why or why not. If not, can it be relaxed for this data set? Briefly explain.

Exercise 10

Note

The independence condition means that knowing one residual will not provide information about another. We often check this by assessing whether the observations are independent based on what we know about the subject matter and how the data were collected.

This condition is sometimes difficult to fully assess, so we just want to consider whether it is reasonably satisfied based on the information and data available.

See Section 6.2.3 of the textbook for more on independence.

Is the independence condition satisfied for these data? Briefly explain why or why not.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Reminder: you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

  • Access Gradescope through the STA 210 Canvas site.

  • Click on the assignment, and you’ll be prompted to submit it.

  • Mark the pages for the “Exercises” question.

  • Select the first page of your .PDF submission to be associated with the “Workflow & formatting” question.

Grading (10pts)

This lab will be graded based on completion and workflow & formatting. The point breakdown is as follows:

Component Points
Completion 8
Workflow & formatting 2

Completion: Points will be awarded for completion based on the following:

  • 8pts: Complete 10 exercises

  • 7pts: Complete 8 - 9 exercises

  • 6pts: Complete 6 - 7 exercises

  • 5pts: Complete 4 - 5 exercises

  • 4pts: Complete 2 - 3 exercises

  • 0pts: Complete < 2 exercises

Workflow & formatting

The workflow & formatting grade is to assess the reproducible workflow and document format. This includes having at least 3 informative commit messages, a neatly organized document with readable code, and your name and the date updated in the YAML.

  • 2pts: Meet all criteria

  • 1pt: Meet some criteria

  • 0pt: Meet no criteria