Quick Start to Doubly Robust Estimators
Here, I will provide a simple explanation on doubly robust estimators using R’s tidyverse. This post is heavily inspired by Matheus Facure’s Causal Inference for the Brave and True. There you can find comprehensive tutorials of many causal inference models in Python.
In a first introduction to Causal Inference, we learn about linear estimators and propensity score weighting methods to estimate the average treatment effect conditional on covariates.
However, covariates are the main cause of counfoundness in causal inference settings, so we should ask: which method should I use? Which one is the best method? Well, actually, you can use both to guarantee unbiased estimators!
Let us imagine we are investigating the causal relationship between income and a government training program that individuals may or may not participate.
First, we create a very simple static “labor market”. Assume 10k workers, which we observe two variables, X1 and X2.
library(tidyverse)
library(fixest)
set.seed(123)
n <- 10000
# Generate covariates
X1 <- rnorm(n)
X2 <- rnorm(n)
For the sake of simplicity, X1 and X2 are standard normal distributions. We can assume there is a function that simply maps these draws to real-world data variables, for example in this case, education and work experience.
Let us create now a treatment variable. In several cases dealing with real world data, where there is no randomized experiment, assignment to treatment is actually correlated to certain variables in a functional form we may not know.
In this labor market example, individuals may observe the opportunity enter the program but some of them are inclined to not participate given they accumulated enough human capital.
# Generate treatment groups (time-invariant) with strong non-linearity
ps_true <- plogis(-1 -0.5*X1 + -0.5*X2)
treat <- rbinom(n, 1, ps_true)
In this case, ps_true represents a logistic function that maps the relationship of X1 and X2 to the treatment variable treaD. The assignment to treatment, therefore, is generated by a binomial distribution that observes these “probability scores” from this relationship and draw a success (participated in the training program) or a failure (refused to participate). Note that the relationship is negative: higher values of human capital, education or work experience, increase the likelihood of no participation.
Well, can we simply do an ATE calculation in this case? Let us finish constructing the data by creating the income variable Y and calculate the classical ATE.
# Treatment effect
tau <- 0.5
# Generate the data
df <- tibble(
id = rep(1:n),
X1 = X1,
X2 = X2,
treat = treat,
Y = tau*treat + 0.25*X1 + 0.25*X2 + rnorm(n, 0, 0.5)
)
# Naive ATE calculation
naive_ATE <- df %>%
group_by(treat) %>%
summarize(meanY = mean(Y)) %>%
summarize(ATE = diff(meanY)) %>%
pull()
r$> naive_ATE
[1] 0.2863138
The true treatment effect of the training program is $\tau = 0.5$. However, when calculating the ATE by simply finding average outcomes from treatment and control group, we are almost half far off the true value.
This can also be verified using an incomplete linear model, where covariates are not included:
# Naive linear estimator
wrong_linear <- feols(Y ~ treat, data = df)[["coefficients"]][2]
r$> wrong_linear
[1] 0.2863138
Terrible estimates. In some cases, the sign even flips! So we should be extra careful when estimating our parameters. Let us put back the covariates in the linear model and see if we can correctly estimate $\tau$:
r$> # Correct linear estmator
correcD_DiD <- feols(Y ~ treat + X1 + X2, data = df)
summary(correcD_DiD)
OLS estimation, Dep. Var.: Y
Observations: 10,000
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.000557 0.006062 0.091834 0.92683
treat 0.511810 0.011573 44.224591 < 2.2e-16 ***
X1 0.249916 0.005092 49.077613 < 2.2e-16 ***
X2 0.247565 0.005191 47.687099 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.502913 Adj. R2: 0.341397
We found 0.51, and our statistically significant range covers the true value. Not bad. So as long as we know the true specification of the linear model, we can get away with the noise created by the covariate relationship.
That is cool. But what if we want to use a propensity score weighting method to calculate the ATE? Let’s see:
# correct propensity scores
ps_model <- feglm(treat ~ X1 + X2, data = df, family = binomial)
# Plug the PS back in the data
df <- df %>%
mutate(
ps = predict(ps_model, type = "response"),
weight = ifelse(treat == 1, 1/ps, 1/(1 - ps))
)
# Correct PS estimator
Y1 <- sum(df$Y[df$treat == 1] * df$weight[df$treat == 1]) / nrow(df)
Y0 <- sum(df$Y[df$treat == 0] * df$weight[df$treat == 0]) / nrow(df)
correcD_PS <- Y1 - Y0
r$> correcD_PS
[1] 0.5085938
In this line of code, we create using the feglm function a logistic model to find the likelihood of treatment based on X1 and X2. We then construct the propensity score weights by finding the inverse of these likelihoods based on treatment assignment.
Why invert it? The logic can be quite simple: we impose heavier weights on rarity. That is, a treated observation that is very similar to control is more valuable than otherwise. The same is applied for the untreated group. If their propensity score is high, even though they were untreated, we give them more importance.
The estimate of 0.508 is not bad! Definitely a very good approximation! We could even construct a bootstrap standard error for this estimate by resampling our data and recalculating over and over. But let us focus on the real star of the show: doubly robust.
The main idea is that we don’t know sometimes the true relationship in our models. What how is the covariates affecting treatment assignment? What is the correct setting for the linear model? So the beauty of doubly robust is that even if we get it wrong in the propensity score or the linear model, we get unbiased estimates when combining both!
r$> # DR with misspecified linear model
dr_lm_misspecified <- feols(Y ~ treat, data = df, weights = df$weight)
summary(dr_lm_misspecified)
OLS estimation, Dep. Var.: Y
Observations: 10,000
Weights: df$weight
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.001506 0.008672 0.173704 0.8621
treat 0.510292 0.012278 41.562297 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.867579 Adj. R2: 0.147238
Note that even with omitted covariates, when we use the propensity score weights as weighting parameters for the linear model, we get very close to the true parameter! Much better than 0.28.
That is cool, but why is that? The secret is in the doubly robust estimator:
\[\widehat{ATE} = \frac{1}{N} \sum \left( \frac{D_i(Y_i - \hat{\mu}_1(X_i))}{\hat{P}(X_i)} + \hat{\mu}_1(X_i) \right) - \frac{1}{N} \sum \left( \frac{(1-D_i)(Y_i - \hat{\mu}_0(X_i))}{1-\hat{P}(X_i)} + \hat{\mu}_0(X_i) \right)\]If both models are correctly specified, we have nothing to worry about. What if the linear model is correct but the propensity model is incorrect?
Note that for both parts, we have a $Y_i - \hat{\mu}_d(X_i)$. If the linear model is fine, $E[Y_i - \hat{\mu}_d(X_i)] = 0$, since this is just the residual of the linear regression! The entire component related to the propensity score goes to zero!
If the propensity score is correct, you can isolate $\hat{\mu}_d(X_i)$. By doing that, you end up with the subtraction $D_i - \hat{P}(X_i)$ which in expectation is also zero. All good!
So what can you do with it? How do we actually know what is the correct setting for propensity scores? There are many researchers looking out for this answer. Recently, the great Pedro Sant’Anna and Jun Zhao provided a pathway to use difference-in-differences using doubly robust estimators. And, if you really wanna go deeper, Edward H. Kennedy provides a review of machine learning tools to estimate doubly robust parameters.