Lab 06: Logistic Regression

ImportantDue date

This lab is due on Thursday, March 26 at 11:59pm. To be considered on time, the following must be done by the due date:

  • Final .qmd and .pdf files pushed to your GitHub repo
  • Final .pdf file submitted on Gradescope

Logistic Regression

In this lab, you will use logistic regression analysis to classify rice varieties based on features derived from image data. You will fit and interpret the logistic regression models using training data, then use testing data to assess how well the model classifies observations into the two varieties.

Learning goals

By the end of the lab you will be able to…

  • conduct exploratory data analysis for data with a binary response variable.

  • fit and interpret logistic regression models.

  • use relevant metrics and subject-matter context to identify a threshold for classification.

  • use training and testing data to fit and assess model performance.

Getting started

  • Click here to see your team for this week’s lab.

  • Go to the sta210-sp26 organization on GitHub. Click on the repo with the prefix lab-06-. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo, starting a new project in R, and configuring git.

  • Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo

Packages

You will use the following packages in today’s lab. Add other packages as needed.

library(tidyverse)
library(tidymodels)
library(knitr)

Data

The dataset in this lab contains measures describing the shape and structure of two rice varieties - Cammeo and Osmancik. To curate the dataset, researchers used images from over 3,000 grains of rice in these two varieties. They then used automated methods to process the image data and extract 7 morphological features (features related to the structure) from each image. The data were originally presented and analyzed in Cinar and Koklu (2019) and was obtained from the UCI Machine Learning Repository.

This analysis will focus on the following variables:

  • Class: Cammeo or Osmancik

  • Area: Size of the rice grain measured in pixels

  • Eccentricity: A measure of how round the ellipse is, i.e., how close the shape of the grain is to a circle.

Click here for the full data dictionary.

rice <- read_csv("data/rice.csv")

Exercises

Goal: The goal of the analysis is to use Area and Eccentricity to identify grains from the Cammeo variety versus those from the Osmancik variety.

Exercise 1

Create a scatterplot of Area versus Eccentricity, such that the color and shape of the points are based on Class.

Exercise 2

Based on the plot from the previous exercise, do you think the two rice varieties can be distinguished based on Area and Eccentricity? Briefly explain.

Exercise 3

Split the data into training (75%) and testing sets (25%). Use a seed of 210 for reproducibility.

Exercise 4

In a logistic regression model, the log-odds of the response being “1” (a “success”) is given by \[\log\Big(\frac{\pi_i}{1-\pi_i}\Big) = \beta_0 + \beta_1X_1 + \dots + \beta_pX_p\]

In this analysis, a “success” means the Class is “Osmancik”.

  1. What does each \(\pi_i\) represent in the context of this analysis?

  2. How is the probability of the response variable being “1” calculated from the log-odds? Show or explain the mathematical steps to go from the log-odds to the probability.

Exercise 5

  1. Use the training data to fit the logistic regression model for the response variable Class using the predictors Area and Eccentricity. Neatly display the output using 3 decimal places.

  2. Does the intercept have any meaningful interpretation in practice? If so, interpret the intercept. If not, explain why.

Exercise 6

Interpret the coefficients on Area and Eccentricity in the context of the data in terms of the odds.

Exercise 7

  1. How would you expect the log-odds of the rice grain being of the Osmancik variety to change if the eccentricity changes from 0.85 to 0.9?

  2. How would you expect the odds of the rice grain being of the Osmancik variety to change if the eccentricity changes from 0.85 to 0.9?

Exercise 8

Now let’s evaluate our model on the test set. Compute the predicted probabilities of the rice being Osmancik for the observations in the testing data.

Exercise 9

With these estimated probabilities, we can now try to classify the rice in the test set. Choose a threshold for assigning a class to each observation based on the estimated probability. Briefly explain your reasoning for selecting the threshold, including any analysis used to make your decision.

Exercise 10

Compare the predicted class assignments based on the threshold from the previous exercise with the actual classes. Comment on the result and the model’s performance.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Reminder: you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit the assignment:

  • Access Gradescope through the STA 210 Canvas site.

  • Click on the assignment, and you’ll be prompted to submit it.

  • Select the name of every team member who contributed to the assignment (if applicable)

  • Mark the pages for the “Exercises” question.

  • Mark the first page for “Workflow & formatting”.

Grading (10 pts)

This lab will be graded based on completion and workflow & formatting. The point breakdown is as follows:

Completion: Points will be awarded for completion based on the following:
Component Points
Complete exercises 8
Workflow & formatting 2
  • 8 pts: Complete all exercises

  • 7 pts: Complete 8 - 9 exercises

  • 6 pts: Complete 6 - 7 exercises

  • 5 pts: Complete 4 - 5 exercises

  • 4 pts: Complete 2 - 3 exercises

  • 0 pts: Complete < 2 exercises

Workflow & formatting

The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member (if applicable) and updating the name and date in the YAML.

  • 2 pts: Meet all criteria

  • 1 pt: Meet some criteria

  • 0 pts: Meet no criteria

References

Cinar, Ilkay, and Murat Koklu. 2019. “Classification of Rice Varieties Using Artificial Intelligence Methods.” International Journal of Intelligent Systems and Applications in Engineering 7 (3): 188–94.