library(tidyverse)
library(tidymodels)
library(knitr)
library(pROC)Lab 07: Exam 02 review
This lab is due on Tuesday, April 7 at 11:59pm. To be considered on time, the following must be done by the due date:
- Final
.qmdand.pdffiles pushed to your GitHub repo - Final
.pdffile submitted on Gradescope
Exam 02 review
The goal of this lab is to review for Exam 02. There are some topics eligible for Exam 02 that are not covered in this review.
Getting started
Click here to see your team for this week’s lab.
Go to the sta210-sp26 organization on GitHub. Click on the repo with the prefix lab-07-. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo, starting a new project in R, and configuring git.
Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo
Packages
Data: Credit cards
The data for this analysis is about credit card customers. It can be found in the file credit.csv. The following variables are in the data set:
income: Income in $1,000’slimit: Credit limitrating: Credit ratingcards: Number of credit cardsage: Age in yearseducation: Number of years of educationown: A factor with levelsNoandYesindicating whether the individual owns their homestudent: A factor with levelsNoandYesindicating whether the individual was a studentmarried: A factor with levelsNoandYesindicating whether the individual was marriedregion: A factor with levelsSouth,East, andWestindicating the region of the US the individual is frombalance: Average credit card balance in $.
The objective of this analysis is to predict whether a person has maxed out their credit card, i.e., had $0 average card balance.
Exercise 1
Create the response variable called
maxedthat takes the value 1 ifbalance ==0and 0 otherwise.Why is logistic regression the best modeling approach for this analysis?
Describe where each of the following show up in the analysis:
log-odds …
odds
probabilities
Exercise 2
Start by splitting the data into training (80%) and testing (20%) sets. Use seed 210. Then fit the model to predict the odds of maxed = 1 using income, rating, and region.
Exercise 3
Use the model from the previous exercise.
Write the interpretation of
incomein terms of the odds of maxing out a credit card.Show the expected change in the odds of maxing out a credit card when the credit rating increases by 10 points. Assume income and region are constant.
Suppose there are two individuals. Individual 1 has an income of $64,000, a credit rating of 590, and is from the South region. Individual 2 has an income of $135,000, a credit rating of 695, and is from the East region. Show how the odds of maxing out a credit card differ between Individual 1 and Individual 2.
Exercise 4
We consider adding the interaction between region and income to the current model. We’ll use a drop-in-deviance test to determine whether or not to add the interaction term.
- State the null and alternative hypotheses in words and using mathematical notation.
- Describe what the test statistic \(G\) means in the context of the data.
- Show why the degrees of freedom for the test statistic are equal to 2.
- Conduct the drop-in-deviance test and state your conclusion in the context of the data.
Exercise 5
Now let’s evaluate the performance of the selected model using the testing data.
Create a confusion matrix using a cutoff probability of 0.3.
What is the sensitivity? What does it mean in the context of the data ?
What is the specificity? What does it mean in the context of the data?
What is the false positive rate? What does it mean in the context of the data?
What is the false negative rate? What does it mean in the context of the data?
Exercise 6
Produce the ROC curve.
Describe how you can use this curve to select a cutoff probability (rather than just going with 0.5).
Exercise 7
Describe one ethical consideration for this analysis.
Submission
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Reminder: you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
To submit the assignment:
Access Gradescope through the STA 210 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Select the name of every team member who contributed to the assignment (if applicable)
Mark the pages for the “Exercises” question.
Mark the first page for “Workflow & formatting”.
Grading (10 pts)
This lab will be graded based on completion and workflow & formatting. The point breakdown is as follows:
| Component | Points |
|---|---|
| Complete exercises | 8 |
| Workflow & formatting | 2 |
8 pts: Complete all exercises
7 pts: Complete 6 exercises
6 pts: Complete 5 exercises
5 pts: Complete 4 exercises
4 pts: Complete 3 exercises
0 pts: Complete < 3 exercises
Workflow & formatting
The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member (if applicable) and updating the name and date in the YAML.
2 pts: Meet all criteria
1 pt: Meet some criteria
0 pts: Meet no criteria