Final project – STA 210

Project milestones

Project proposal due Thursday, February 26

Exploratory data analysis due Tuesday, March 24

Preliminary analysis due April 7

Presentation + Presentation comments due Tuesday, April 14 and Thursday, April 16

Draft report + Peer review due Monday, April 20 (draft due before lab)

Written report due Monday, April 27

Repo organization due Monday, April 27

Introduction

The goal of the final project is for you to use regression to analyze data and explore a research question of your choice. The data may be from an existing data set or you may collect your own data from the internet.

Choose the data based on your group’s interests or work you all have done in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques covered in this course (and beyond, if you choose) by applying them to a data set in a meaningful and rigorous way.

Important

All analyses must be done in RStudio using Quarto and GitHub, and your analysis and written report must be reproducible.

Logistics

You may complete the project individually or in pairs. The primary deliverables for the project are the following:

An in-person presentation about your analysis and results
A written, reproducible final report detailing your analysis and results
A GitHub repository containing all work from the project

There are intermediate milestones and peer review assignments throughout the semester to help you work towards the primary deliverables. These milestones are described below.

Project proposal

Submission

Write your narrative and analysis for Sections 1 - 4 in the proposal.qmd file in your project GitHub repo. Put the data set and the data dictionary in the data folder in the repo. Push the qmd and rendered pdf documents to GitHub by the deadline, Thursday, February 26 at 11:59pm.

There is no Gradescope submission.

The purpose of the project proposal is for you to identify the data set you’re interested in analyzing to investigate one of your potential research topics. You will also do some preliminary exploration of the response variable and begin thinking about the modeling strategy. If you’re unsure where to find data, you can use the list of potential data sources on the Tips + resources page as a starting point.

Important

You must use the data set(s) in the proposal for the final project, unless instructed otherwise when given feedback.

The data set must meet the following criteria:

At least 500 observations
At least 10 columns, with at least 6 of the columns are useful and unique predictor variables.
- e.g., identifier variables such as “name”, “ID number”, etc. are not useful predictor variables.
- e.g., if you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique predictors.
At least one variable that can be identified as a reasonable response variable.
- The response variable can be quantitative or categorical.
A mix of quantitative and categorical variables that can be used as predictors.
The data set may not be one that has been used in this course or in other course materials, nor any derivation of such data.

Types of data sets to avoid

Data that are likely to violate the independence condition (e.g., repeated measures on the same subjects, longitudinal/time-series data, etc.).
Data sets in which there is no information about how the data were originally collected
Data sets in which there are missing or unclear definitions about the observations and/or variables

Ask a member of the teaching team if you’re unsure whether your data set meets the criteria.

The proposal will include the following sections:

Section 1: Introduction

An introduction to the subject matter you’re investigating (citing any relevant literature)
Statement of a well-developed research question.
The motivation for your research question and why it is important
Your hypotheses regarding the research question
- This is a narrative about what you think regarding the research question, not formal statistical hypotheses.

Section 2: Data description

The source of the data set
A description of when and how the data were originally collected (by the original data curator, not necessarily how you found the data)
A description of the observations and general characteristics being measured

Section 3: Data processing

Description of data processing you need to do to prepare for analysis, such as joining multiple data sets, handling missing data, etc.

Section 4: Analysis approach

Visualizations, summary statistics, and narrative to describe the distribution of the response variable.
A description of the potential predictor variables of interest
Regression model technique (multiple linear regression for quantitative response variable or logistic regression for a categorical response variable)

Data dictionary (aka code book)

Submit a data dictionary for all the variables in your data set in the README of the data folder. You do not need to include the data dictionary in the PDF document.

Submission

Write your narrative and analysis for Sections 1 - 4 in the proposal.qmd file in your project GitHub repo. Put the data set and the data dictionary in the data folder in the repo.

Grading

The anticipated length, including all graphs, tables, narrative, etc., is 2 -4 pages.

The proposal is worth 10 points and will be graded based on accurately and comprehensively addressing the criteria stated above. Points will be assigned based on a holistic review of the project proposal.

Excellent (10 pts): All required elements are completed and are accurate. The data set meets the requirements (or the student has otherwise discussed the data with Professor Tackett) and the data do not pose obvious violations to the modeling assumptions. There is a thoughtful and comprehensive description of the data, any data processing, and exploration of the response variable as described above. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.
Strong (8 - 9 pts): Requirements are mostly met, but there are some elements that are incomplete or inaccurate. Some minor revision of the work required before student is ready for modeling.
Satisfactory (6 - 7 pts): Requirements partially met, but there are some elements that are incomplete and/or inaccurate. Major revision of the work required before student is ready for modeling.
Needs Improvement (4 - 5 pts): Requirements are largely unmet, and there are major elements that are incomplete and/or inaccurate. Substantial revisions of the work required before student is ready for modeling.

Exploratory data analysis

Submission

Write your draft introduction and exploratory data analysis in the written-report.qmd file in your GitHub repo. Push the .qmd and rendered .pdf documents to GitHub by Tuesday, March 24 at 11:59pm.

The purpose of this milestone is to begin exploring the data and get early feedback on your data and analysis. You will submit a draft of the early sections of the report that includes the introduction and exploratory data analysis, with an emphasis on the EDA. It will also help you prepare for the presentation of the exploratory data analysis results.

Below is a brief description of the sections to include in this step:

Introduction

This section includes an introduction to the project motivation, background, data, and research question. Additionally, comment on any known limitations, missingness, or measurement concerns that may impact the interpretation of your analysis and results.

Tip

Reuse and iterate on the work from the previous milestones.

Exploratory data analysis

This section includes the following:

Description of the data set and key variables.
Exploratory data analysis of the response variable and key predictor variables. This includes visualizations, summary statistics, and narrative. Focus on 4 - 6 key predictors that are most relevant to your research question. You may explore additional variables, as appropriate, but prioritize depth over breadth.
- Univariate EDA of the response and key predictor variables
- Bivariate EDA of the response and key predictor variables
- Visualizations exploring at least one potential interaction effect

For each visualization and/or set of summary statistics, briefly describe what you learn about the distribution of the variable or relationship between variables. Additionally, comment on how the observations from the EDA may inform your modeling decisions (e.g., use of transformations, quadratic effects, interactions effects, etc.)

Grading

The anticipated length with code, warnings, and messages suppressed in the rendered pdf is about 5 - 7 pages. It is OK to be over this page limit at this stage in the project.

Tip

Include the code below in the YAML to suppress code, messages, and warnings in the rendered pdf.

execute:
  echo: false
  message: false
  warning: false

The exploratory data analysis is worth 10 points and will be graded based on accurately and comprehensively addressing the criteria stated above, along with incorporating the feedback from the proposal. Points will be assigned based on a holistic review of the exploratory data analysis.

Excellent (8- 10 points) : All required elements are completed and are accurate. There is a thorough exploration of the data as described above, and the student has demonstrated a careful and thoughtful approach exploring the data and preparing it for analysis. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.
Strong (6 - 7 points): Requirements are mostly met, but there are some elements that are incomplete or inaccurate. Minor gaps or areas needing clarification, but the analysis is sound overall and will be ready for modeling with minor revision.
Satisfactory (4 -5 points): Requirements partially met, but there are some elements that are incomplete and/or inaccurate. This may include multiple gaps in the analysis, unclear interpretations, or limited connection between EDA and modeling decisions. Major revision is required before the student is ready for modeling.
Needs Improvement (3 points or fewer): Requirements are largely unmet, and there are large elements that are incomplete and/or inaccurate. Substantial revisions of the work required before the student is ready for modeling.

Preliminary analysis

Submission

Write your draft analysis in the written-report.qmd file in your GitHub repo. Push the .qmd and rendered .pdf documents to GitHub by April 7 at 11:59pm.

The purpose of the draft analysis is to get early feedback on your modeling approach. Therefore, this project milestone focuses on the eventual Methods and Results sections of the final report.

Introduction

This section includes an introduction to the project motivation, data, and research question. Include discussion about work others have done related to your research question. Describe the data and definitions of key variables. It should also include selected exploratory data analysis that helps motivate the modeling decisions described later in the report. Not all of the EDA will fit in the body of the report, so focus on the EDA for the response variable and a few other interesting variables and relationships.

This section is not the primary focus of the draft analysis, but you are encouraged to continue revising the introduction based on the feedback from the exploratory data analysis milestone.

Methodology

This section describes your modeling approach and the process used to select the final model. Explain the reasoning for the type of model you’re fitting, the predictor variables considered for the model, including any interactions. Additionally, show how you arrived at the model you are currently considering as the final model by describing the model selection process, any variable transformations (if needed), and any other relevant considerations that were part of the model fitting process.

This section should focus on the modeling process rather than the numerical results.

Results

In this section, you will present the final model and include a brief discussion of the model assumptions, diagnostics, and any relevant model fit statistics.

This section also includes initial interpretations of key model coefficients and discussion of inferential conclusions drawn from the model. The goal is not to interpret every single variable in the model but rather to show that you are proficient in using the model output to address the research question, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.

Grading

The draft analysis is worth 10 points and will be graded based on accurately and comprehensively addressing the criteria stated above, along with incorporating the feedback from previous milestones.

Presentation

Submission

Presentations will take place during lecture on April 14 and April 16. The slides must be submitted by the beginning of lecture on April 14. We will use the classroom computer for presentations.

The slides may be submitted in one of the following ways:

Put a PDF of the slides or Quarto slides in the presentation folder in your GitHub repo.
Put the URL to your slides in the README of the presentation folder. If you share the URL, please make sure permissions are set so Prof. Tackett can view the slides.

There is no make up work for the presentation.

You will do an in-person presentation that summarizes and showcases the work you’ve done on the project. It will also be an opportunity to receive feedback on your project that can be incorporated in the final written report. Through the presentation, you will introduce the subject matter and research question, showcase key results from the exploratory data analysis and modeling, interpret key predictors, and present initial conclusions from the analysis. The presentation should be supported by slides that serve as a brief visual aid to the presentation. The presentation and slides will be graded for content accuracy and clarity.

You can create your slides with any software you like (e.g., Keynote, PowerPoint, Google Slides, etc.). You can also use Quarto to make your slides! While we will not cover making slides with Quarto in class, the teaching team can help you during office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep! You can learn more about making slides in Quarto here: https://quarto.org/docs/presentations/revealjs/

The presentation is expected to be between 4 and 5 minutes. It may not exceed 5 minutes, to ensure everyone has the opportunity to present.

Slides

Use no more than 6 content slides + 1 title slide to ensure you have enough time to discuss each slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.

Title Slide
Slide 1: Introduce the subject, motivation, and research question
Slide 2: Introduce the data set
Slide 3 - 4: Highlights from the EDA (include EDA on the response variable and focus on the most important patterns or relationships that may motivate your modeling approach)
Slide 5: Current model and key predictors
Slide 6: Conclusions drawn from model

Grading

The presentation is worth 15 points. It will be graded based on the following:

Content: The student told a unified story that clearly introduced the subject matter, research question, and analysis of the data.
Slides: The presentation slides were organized, included clear and informative visualizations, and were easily readable.
Presentation: The student’s communication style was clear and professional, and the presentation stayed within the time limit.

80% of the presentation grade will be the average of the teaching team scores, and 20% will be the average of the peer scores.

Presentation comments

Submission

Presentation comments must be submitted through the feedback form by the end of the lecture.

See the Canvas announcement the day before presentations for your feedback assignments and link to the feedback form.

There is no make up work for the presentation comments.

You will provide feedback on two presentations on the day you are not presenting. There will be a few minutes between each presentation to submit scores and comments.

Grading

The presentation comments are worth 2 points. The grade will be based on submitting the scores and comments for both of your assigned presentations by the end of the lecture.

Draft report

Submission

Write your draft report in the written-report.qmd file in your GitHub repo. Push the .qmd and rendered .pdf documents to GitHub before the start of your lab section on April 20.

The purpose of the draft and peer review is to give you an opportunity to receive feedback on your written report before the final submission.

The draft report will include all required sections of the final written report. There is no page limit for the draft, but keep in mind that the final written report has a strict 10-page limit. For the draft and final written report, additional EDA or supporting analyses can be included in an appendix if needed. The appendix does not count towards the 10-page limit.

Introduction

This section includes an introduction to the project motivation, data, and research question. Include discussion about work others have done related to your research question. Describe the data and definitions of key variables. It should also include selected exploratory data analysis that helps motivate the modeling decisions described later in the report. Not all of the EDA will fit in the body of the report, so focus on the EDA for the response variable and a few other interesting variables and relationships.

Methodology

This section describes your modeling approach and the process used to select the final model. Explain the reasoning for the type of model you’re fitting, the predictor variables considered for the model, including any interactions. Additionally, show how you arrived at the model you are currently considering as the final model by describing the model selection process, any variable transformations (if needed), and any other relevant considerations that were part of the model fitting process.

This section should focus on the modeling process rather than the numerical results.

Results

In this section, you will present the final model and include a brief discussion of the model assumptions, diagnostics, and any relevant model fit statistics.

This section also includes initial interpretations of key model coefficients and discussion of inferential conclusions drawn from the model. The goal is not to interpret every single variable in the model but rather to show that you are proficient in using the model output to address the research question, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.

Discussion

This section is a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. Connect the conclusions drawn from your analysis to previous research, and discuss how they align or differ. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.

Grading

The draft will be graded based on whether there is demonstration of a reasonable and complete attempt at each section described above in written-report.qmd and the rendered PDF file in your GitHub repo by the deadline.

Peer review

Submission

Write the peer review in the Peer Review issue in your assigned GitHub repo by April 21 at 11:59pm.

Critically reviewing others’ work is a crucial part of the scientific process, and STA 210 is no exception. Everyone will be assigned two other projects to review. Everyone should push their draft to their GitHub repo before the start of their lab session on the day of the peer review lab. During lab, you will have the opportunity to review and provide quality feedback on one other project.

During the peer review process, you will be provided read-only access to the assigned GitHub repo. Provide your review in the form of a GitHub issue to the repo using the provided peer review issue template.

Steps for peer review

Peer review assignments

Go to the Canvas announcement to see which project you are reviewing. You’ll spend about 30 minutes reviewing the project.

When you get to lab, you should have access to the GitHub repo for the project you’re reviewing. In GitHub, search the repositories for project, and you should see the repo for the project you’re reviewing. You will be able to read the files in the repo and post issues, but you cannot push changes to the repo. You will have access to the repo until the deadline for the peer review.
Your feedback should be specific and constructive. Avoid brief comments such as “looks good.” Instead, explain what is effective and suggest concrete improvements where appropriate.
For the project you’re reviewing:
- Open the repo, read the project draft, and browse the rest of the repo.
- Go to the Issues tab in that repo, click on New issue, and click on Get started for the Peer Review issue. Write your responses to the prompts in the issue. You will answer the following questions:
  - Describe the goal of the project.
  - Describe the data set used in the project. What are the observations in the data? What is the source of the data? How were the data originally collected?
  - Consider the exploratory data analysis (EDA). Describe one aspect of the EDA that is effective in helping you understand the data. Provide constructive feedback on how the student might improve the EDA.
  - Describe the statistical methods, analysis approach, and discussion of model assumptions, diagnostics, model fit.
  - Provide constructive feedback on how the student might improve their analysis. Make sure your feedback includes at least one comment on the statistical modeling aspect of the project, but also feel free to comment on aspects beyond the modeling.
  - Provide constructive feedback on the interpretations and initial conclusions. What is most effective in the presentation of the results? What additional detail can the student provide to make the results and conclusions easier for the reader to understand?
  - What aspect of this project are you most interested in and think would be interesting to highlight more (or include) in the written report?
  - Provide constructive feedback on any issues with file and/or code organization.
  - (Optional) Any further comments or feedback?

Grading

The peer review will be graded based on the extent to which it comprehensively and constructively addresses the components on the peer review form.

Written report

Important

The written report must be completed in the written-report.qmd file and must be reproducible. Make sure all code, warnings, and messages are suppressed in the final written report. The report in written-report.qmd and the rendered PDF should be pushed to the GitHub repo by April 27 at 11:59pm.

Below are general details about the final written report.

The rendered PDF must match exactly what is produced when rendering written-report.qmd in your GitHub repo.
The report, including tables and visualizations, must be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address the analysis and clearly communicate your findings.
Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive analysis rather than explain every step of the analysis.
You are welcome to include an appendix with additional work at the end of the written report document, as needed; however, grading will overwhelmingly be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. The appendix does not count towards the 10-page limit.

The mandatory components of the written report are below. You are free to add additional sections as necessary.

Introduction

This section includes an introduction to the project motivation, data, and research question. Include discussion about work others have done related to your research question. Describe the data and definitions of key variables. It should also include selected exploratory data analysis that helps motivate the modeling decisions described later in the report. Not all of the EDA will fit in the body of the report, so focus on the EDA for the response variable and a few other interesting variables and relationships.

Grading criteria

The research question and motivation are clearly stated in the introduction, including citations for the data source and any external research. The data are clearly described, including a description of how the data were originally collected and a concise definition of the variables relevant to understanding the report. The data cleaning process is clearly described, including any decisions made in the process (e.g., creating new variables, removing observations, etc.) The exploratory data analysis helps the reader better understand the observations in the data along with interesting and relevant relationships between the variables. It incorporates appropriate visualizations and summary statistics.

Methodology

This section describes your modeling approach and the process used to select the final model. Explain the reasoning for the type of model you’re fitting, the predictor variables considered for the model, including any interactions. Additionally, show how you arrived at the model you are currently considering as the final model by describing the model selection process, any variable transformations (if needed), and any other relevant considerations that were part of the model fitting process.

This section should focus on the modeling process rather than the numerical results.

Grading criteria

The analysis steps are appropriate for the data and research question. A thorough and careful approach was used to select the variables in the final model; the approach is clearly described in the report. The model selection process took into account potential interaction effects and addressed any violations of model assumptions. If violations of model assumptions are still present, there was a reasonable attempt to address the violations based on the course content.

Results

In this section, you will present the final model and include a brief discussion of the model assumptions, diagnostics, and any relevant model fit statistics.

This section also includes initial interpretations of key model coefficients and discussion of inferential conclusions drawn from the model. The goal is not to interpret every single variable in the model but rather to show that you are proficient in using the model output to address the research question, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.

Grading criteria

The model fit is clearly assessed, and interesting findings from the model are clearly described. The model conditions and diagnostics are thoroughly and accurately assessed for the final model, if not previously discussed in the methodology. Interpretations of model coefficients are used to support the key findings and conclusions, rather than merely listing the interpretation of every model coefficient. If the primary modeling objective is prediction, the model’s predictive power is thoroughly assessed.

Discussion

This section is a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. Connect the conclusions drawn from your analysis to previous research, and discuss how they align or differ. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.

Grading criteria

Overall conclusions from analysis are clearly described, and the model results are put into the larger context of the subject matter and original research question. There is thoughtful consideration of potential limitations of the data and/or analysis, and ideas for future work are clearly described.

Organization + formatting

This is an assessment of the overall presentation and formatting of the written report.

Grading criteria

The report is neatly written and organized with clear section headers. Figures have informative labels, are appropriately sized, and are easy to read. Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted and labeled. All citations and links are properly formatted. If there is an appendix, it is reasonably organized and easy for the reader to find relevant information. All code, warnings, and messages are suppressed. The main body of the written report (not including the appendix) is no longer than 10 pages.

AI Disclosure

Did you use an LLM / Generative AI tool to complete the project? If not, copy and paste the first option in your Quarto document. Otherwise, copy and paste all statements that describe how you used it. The purpose of the disclosure is for you to reflect on how you’re using AI in this course. It also helps me learn how students are most effectively using AI.

I didn’t use an LLM / Generative AI tool.
I asked it to clarify criteria.
I asked it clarifying questions to better understand a concept.
I asked it to help write code to complete a part of the project.
I gave it my code and asked it to help me fix it.
I asked it about an error or why code would do something I didn’t want.
Other:______

Repo organization

All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.

The GitHub repo should have the following structure:

README: Project title and name(s)
- Optional: Short project summary
written-report.qmd & written-report.pdf: Final written report
proposal.qmd & proposal.pdf: Project proposal
/data: Folder that contains the data set for the final project.
- /data/README.md: Data dictionary and source for data set
project.Rproj: File specifying the RStudio project
/presentation: Folder with the presentation slides or link to slides.
.gitignore: File that lists all files that are in the local RStudio project but not the GitHub repo
/.github: Folder for peer review issue template
Any other files should be neatly organized into clearly labeled folders.

Update the README of the project repo with your project title and name(s).

Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.

Overall grading

The grade breakdown is as follows:

Total	100 pts
Project proposal	10 pts
Exploratory data analysis	10 pts
Draft analysis	10 pts
Presentation	15 pts
Presentation comments	2 pts
Draft report + peer review	10 pts
Written report	40 pts
Repo organization	3 pts

Late work policy

There is no late work accepted on the draft report or presentation. Other components of the project may be accepted up to 48 hours late. A 10% late deduction will apply for each 24-hour period late.

Be sure to turn in your work early to avoid any technological mishaps.