Causal inference

Author

Prof. Maria Tackett

Published

Apr 02, 2026

Announcements

  • Project:

    • Preliminary analysis due April 7

    • Presentations April 14 & 16

  • Statistics experience due April 15

  • Exam 02 April 9 (in-class), April 9 - 11 (take-home)

    • Lecture recordings + practice questions available on website

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(knitr)
library(MatchIt) # for propensity score matching
library(ggridges)
library(patchwork)


# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

# set color palettes 
sunset2 <- PNWColors::pnw_palette("Sunset2",2)
sunset3 <- PNWColors::pnw_palette("Sunset2",3)
sunset4 <- PNWColors::pnw_palette("Sunset2",4)

Topics

  • Discuss data science ethics case study

  • Introduce causal inference for observational data

  • Use propensity scores to create a matched data set

  • Draw causal claims using the matched data

Data science ethics case study

A data scientist received permission to analyze a data set that was scraped from a social media site. The full data set included name, screen name, email address, geographic region, IP (internet protocol) address, demographic profiles, and preferences for relationships.


What are ethical considerations of putting a deidentified data set with name and email address removed in a LLM (e.g., Claude or ChatGPT) to help with analysis?

Adapted from Chapter 8 of Baumer, Kaplan, and Horton (2024)

Causal inference

Impact of Project ACE

Project Action for Equity (Project ACE) was a “five year interdisciplinary program aimed to get more underrepresented high school students from disadvantaged backgrounds to get interested in college degrees in engineering as well as biomedical and behavioral sciences” (Texas at El Paso College of Liberal Arts n.d.).


Our goal is to evaluate whether the data provide evidence that participating in Project ACE had a positive impact on students’ GPA.


The data were obtained from Evans, Perez, and Morera (2025), and the analysis in the lecture will closely follow the analysis in the original article.

Variables

The data are in project-ace-data.csv. We will use the following variables:

  • Grade: Grade level (9, 10, 11, 12)
  • Gender: Gender (F, M)
  • Ethnicity: Ethnicity (Hispanic, Non-Hispanic)
  • ELL: Whether student is an English language learner (N, Y)
  • Sped: Whether student is in a special education program (N, Y)
  • Homeless: Whether student is homeless (N, Y)
  • Tracking.Pathway: Whether student was in Project ACE (Treatment) or not (Control)
  • Current.GPA: Grade Point Average (GPA) ranging 0 to 5.0

Glimpse of data

Rows: 1,300
Columns: 13
$ Student.ID                                        <dbl> 3955, 7008, 7791, 59…
$ Grade                                             <fct> 12, 11, 11, 10, 11, …
$ Gender                                            <chr> "F", "M", "M", "M", …
$ Race                                              <chr> "C", "C", "C", "C", …
$ Ethnicity.Hispanic.Y.N                            <chr> "Hispanic", "Hispani…
$ ELL                                               <chr> "Y", "Y", "Y", "Y", …
$ Sped                                              <chr> "Y", "Y", "N", "Y", …
$ Homeless                                          <chr> "N", "N", "N", "N", …
$ Free.Lunch                                        <chr> "Y", "Y", "Y", "Y", …
$ Migrant                                           <chr> "N", "N", "N", "N", …
$ Current.GPA                                       <dbl> 3.455, 1.688, 2.233,…
$ Number.of.Classes.Enrolled.in.for.the.school.year <dbl> 7, 7, 7, 7, 7, 7, 7,…
$ Tracking.Pathway                                  <fct> Control, Control, Co…

Distribution of tracking pathway

Distribution of GPA

How does the distribution of GPA compare between the two groups?

Covariates by tracking pathway

Try a model

term estimate std.error statistic p.value
(Intercept) 2.391 0.054 44.383 0.000
Grade10 0.251 0.061 4.106 0.000
Grade11 0.310 0.062 5.024 0.000
Grade12 0.409 0.064 6.400 0.000
GenderM -0.267 0.046 -5.844 0.000
Ethnicity.Hispanic.Y.NNon Hispanic 0.049 0.184 0.268 0.789
ELLY -0.307 0.048 -6.401 0.000
SpedY -0.258 0.061 -4.199 0.000
HomelessY -0.163 0.222 -0.737 0.461
Tracking.PathwayTreatment 0.140 0.071 1.967 0.049

Draw causal conclusion?



Why should we avoid using this model to conclude participation in Project ACE improved students’ GPAs?

Two types of data

  • Experimental data: Data obtained by explicitly applying a treatment

    • Individuals randomly assigned to treatment or control groups
    • Can draw causal conclusions, because we can assume the only difference between treatment and control groups is the treatment itself
  • Observational data: Data obtained without explicitly applying a treatment

    • Potential underlying factors that are confounded with response and likelihood of being in treatment group

    • Cannot draw causal conclusions from typical regression analysis

Causal conclusions from observational data

  • We will use statistical methods to make observational data look more like experimental data

  • We do so by making the treatment and control groups similar based on a set of covariates (variables) that impact both the response and the likelihood an individual is in the treatment group

  • We create these groups matching individuals in the treatment and control groups based on these underlying covariates

    • We do so using propensity score matching

Propensity score matching

  • Propensity score: The probability an observation is assigned to the treatment group based on a set of confounding variables that directly impact the response and likelihood of being in the treatment group

  • Propensity score matching: Create a new data set by matching individuals in the treatment and control groups who have the same (or similar) propensity scores

  • Use the matched data set to model the effect of the treatment

Propensity score matching

Source: Figure 1 in Evans, Perez, and Morera (2025)

Propensity score model

Code
propensity_score_model <- glm(Tracking.Pathway ~ Grade + Gender + Ethnicity.Hispanic.Y.N + ELL + Sped + Homeless,
                        data = project_ace, 
                        family = "binomial")
term estimate std.error statistic p.value
(Intercept) -1.670 0.216 -7.733 0.000
Grade10 0.539 0.264 2.040 0.041
Grade11 0.566 0.263 2.155 0.031
Grade12 0.314 0.273 1.151 0.250
GenderM -1.390 0.206 -6.731 0.000
Ethnicity.Hispanic.Y.NNon Hispanic -0.498 0.775 -0.643 0.520
ELLY -0.792 0.222 -3.562 0.000
SpedY 0.017 0.277 0.060 0.952
HomelessY 1.126 0.649 1.734 0.083

Propensity scores

The propensity scores are the predicted probabilities from the logistic regression model

Code
# compute propensity scores
project_ace_aug <- augment(propensity_score_model, 
                           type.predict = "response")

project_ace_aug |> slice(1:10)
# A tibble: 10 × 13
   Tracking.Pathway Grade Gender Ethnicity.Hispanic.Y.N ELL   Sped  Homeless
   <fct>            <fct> <chr>  <chr>                  <chr> <chr> <chr>   
 1 Control          12    F      Hispanic               Y     Y     N       
 2 Control          11    M      Hispanic               Y     Y     N       
 3 Control          11    M      Hispanic               Y     N     N       
 4 Control          10    M      Hispanic               Y     Y     N       
 5 Control          11    M      Hispanic               N     N     N       
 6 Control          12    M      Hispanic               N     Y     N       
 7 Control          12    M      Hispanic               N     N     N       
 8 Treatment        12    M      Hispanic               N     N     N       
 9 Control          12    M      Hispanic               Y     N     N       
10 Control          12    F      Hispanic               N     N     Y       
# ℹ 6 more variables: .fitted <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>,
#   .cooksd <dbl>, .std.resid <dbl>

Common support

Common support: Individuals in the matched treatment and control groups have a non-zero probability of being in the treatment group

  • What do you observe from the visualization?

  • Why is common support important?

Propensity score matching in R

We conduct propensity score matching using matchit() from the MatchIt R package (will need to install in the RStudio Docker containers)

library(MatchIt)

# generate propensity scores and matches
project_ace_psm <- matchit(Tracking.Pathway ~ Grade + Gender + Ethnicity.Hispanic.Y.N +
                          Race + ELL + Sped + Homeless,
                 data = project_ace, method = "nearest", distance = "logit")

# matched data set
project_ace_matched <- match.data(project_ace_psm)

. . .

  • method = "nearest": Each observation in the treatment group is matched to the observation in the control group with closest propensity score

  • distance = "logit": Generate the propensity scores using a logistic regression model

Matched data


How many observations are in the matched data set based on our matching process?


. . .

We’re using \(1:1\) matching, but there are other approaches such as \(1:k\) and weighting to reduce data loss

Matched data

Rows: 296
Columns: 16
$ Student.ID                                        <dbl> 9478, 8268, 9846, 22…
$ Grade                                             <fct> 12, 12, 12, 12, 11, …
$ Gender                                            <chr> "M", "M", "F", "F", …
$ Race                                              <chr> "C", "C", "C", "C", …
$ Ethnicity.Hispanic.Y.N                            <chr> "Hispanic", "Hispani…
$ ELL                                               <chr> "N", "N", "N", "N", …
$ Sped                                              <chr> "N", "N", "N", "N", …
$ Homeless                                          <chr> "N", "N", "Y", "N", …
$ Free.Lunch                                        <chr> "Y", "Y", "Y", "Y", …
$ Migrant                                           <chr> "N", "N", "N", "N", …
$ Current.GPA                                       <dbl> 1.452, 2.957, 2.955,…
$ Number.of.Classes.Enrolled.in.for.the.school.year <dbl> 7, 7, 7, 7, 7, 7, 7,…
$ Tracking.Pathway                                  <fct> Control, Treatment, …
$ distance                                          <dbl> 0.05897680, 0.058976…
$ weights                                           <dbl> 1, 1, 1, 1, 1, 1, 1,…
$ subclass                                          <fct> 1, 1, 42, 3, 2, 2, 4…

Example matches

subclass Tracking.Pathway distance Grade Gender Ethnicity.Hispanic.Y.N ELL Sped Homeless
14 Control 0.3786138 9 F Hispanic N N Y
14 Treatment 0.4507080 12 F Hispanic N N Y
50 Control 0.2496294 11 F Hispanic N N N
50 Treatment 0.2496294 11 F Hispanic N N N
118 Control 0.1585438 9 F Hispanic N N N
118 Treatment 0.1585438 9 F Hispanic N N N

Covariates by group in matched data

Fit the treatment model

Use matched data set to fit the treatment model.

treatment_model <- lm(Current.GPA ~ Tracking.Pathway, 
                      data = project_ace_matched)
term estimate std.error statistic p.value
(Intercept) 2.418 0.069 35.043 0.000
Tracking.PathwayTreatment 0.209 0.098 2.138 0.033


Describe the effect of Project ACE on GPA. Is there evidence that participating in the program has a positive impact on GPA?

Average treatment effect

  • The model produces the average treatment effect on the matched population
  • The results may not generalize to the entire population if the observations removed from the analysis are systematically different than the observations in the matched data
  • Strategies such as \(1:k\) matching and weighting can mitigate this limitation

Recap

  • Introduced causal inference for observational data

  • Used propensity scores to create a matched data set

  • Drew causal claims using the matched data

Further reading

All books are freely available online.

Exam 02 review

We will do Exam 02 review in lab on Monday, April 6 and lecture on Tuesday, April 7. Exam 02 covers content from multicollinearity (February 24) - today.

Please submit one question you have about the Exam 02 content. I will use these questions to write the exam reviews.

🔗 https://forms.cloud.microsoft/r/0ukue0JaMw

Next class

  • Exam 02 review

  • No prepare assignment

References

Baumer, Benjamin S, Daniel T Kaplan, and Nicholas J Horton. 2024. Modern Data Science with r. 3rd ed. https://mdsr-book.github.io/mdsr3e/.
Evans, Nicholas D, Perla C Perez, and Osvaldo F Morera. 2025. “Testing the Efficacy of Educational Interventions on Matched Student Samples: A Primer for Propensity Score Matching in r.” Journal of STEM Outreach 8 (1): 1–9.
Texas at El Paso College of Liberal Arts, The University of. n.d. “Project ACE: Action for Equity.” https://www.utep.edu/liberalarts/project-ace/.