Estimation Methods

Overview

We provide two estimators that correctly handle bad controls in DiD. Both require the same core assumption — Covariate Unconfoundedness — but differ in how they estimate the nuisance functions.

Method Nuisance estimation Robustness When to use
Imputation Parametric (OLS) Single robustness Simple settings, linear relationships
Doubly Robust ML OLS outcome + random forest propensity score Double robustness When you want robustness to model misspecification

Both are implemented in the badcontrols R package through bc_att_gt().

The key assumption

Covariate Unconfoundedness (Assumption 5 in the paper):

\[ X_t(0) \perp\!\!\!\perp D \mid X_{t-1}, W, Z \]

where:

  • \(X_t(0)\) is the counterfactual covariate (what \(X_t\) would be without treatment)
  • \(X_{t-1}\) is the pre-treatment covariate
  • \(W\) is an auxiliary variable (typically \(Y_{t-1}\), the lagged outcome)
  • \(Z\) is a vector of time-invariant covariates

This says: conditional on what we observe pre-treatment, the counterfactual \(X_t(0)\) is independent of treatment status. The control group’s \(X_t\) model can be used to predict the treated group’s counterfactual.

The role of W: the auxiliary variable

The auxiliary variable \(W\) deserves special attention. Why isn’t \((X_{t-1}, Z)\) enough?

As the DAG shows, \(W\) is a pre-treatment confounder of both treatment \(D\) and the post-treatment covariate \(X_t\):

\[ D \leftarrow W \rightarrow X_t(0) \]

If you only condition on \((X_{t-1}, Z)\), the confounding path through \(W\) remains open. The imputation model would systematically mispredict \(X_t(0)\) for treated units because the treated and control groups differ in their \(W\) distributions.

Conditioning on \(W\) closes the backdoor path and ensures the imputation is unbiased. In the badcontrols package, \(W\) is included via the xformla argument. The imputation model becomes:

\[ X_t = f(X_{t-1}, W, Z) + \varepsilon \]

NoteWhen is W not needed?

If Simple Covariate Unconfoundedness holds — \(X_t(0) \perp D \mid X_{t-1}, Z\) — then \(W\) is not required. This is a stronger condition that says \((X_{t-1}, Z)\) alone are sufficient. You can test sensitivity to this by running the estimator with and without \(W\).

TipWhat can serve as W?

Any pre-treatment variable that confounds treatment selection and covariate evolution. Examples:

  • Pre-treatment covariates observed in the data (e.g., baseline health, prior earnings)
  • Lagged outcomes \(Y_{t-1}\) (set lagged_outcome_cov = TRUE)
  • Other lagged covariates measured before treatment

The key requirement is that \(W\) is (i) not affected by treatment and (ii) closes the confounding path between \(D\) and \(X_t(0)\).

Method 1: Imputation

Algorithm

Step 1: Impute \(X_t(0)\)

Using the control group (where \(X_t = X_t(0)\)), estimate:

\[ X_t = \alpha + \beta_1 X_{t-1} + \beta_2 W + \beta_3 Z + \varepsilon \]

For treated units, predict \(\hat{X}_t(0)\) using the estimated coefficients and their pre-treatment values.

Step 2: Run DiD with imputed covariate

Replace the observed \(X_t\) with \(\hat{X}_t(0)\) for treated units, then estimate ATT using standard DiD methods conditioning on \(\hat{X}_t(0)\).

R code

library(badcontrols)

res_imp <- bc_att_gt(
  yname = "Y",
  gname = "G",
  tname = "period",
  idname = "id",
  data = sim$data,
  bad_control_formula = ~X,   # X is the bad control to impute
  xformla = ~Z + W,             # pre-treatment covariates (W is the auxiliary variable)
  est_method = "imputation"
)

extract_att(res_imp)

When it works well

  • Linear relationship between \(X_t\) and \((X_{t-1}, W, Z)\)
  • Moderate sample sizes
  • Fast computation

Method 2: Doubly Robust ML

The idea

The DR estimator builds on the imputation approach but adds a propensity score correction that provides double robustness. It combines three steps:

  1. Impute \(X_t(0)\) via OLS (same as Method 1)
  2. Outcome regression: \(\mathbb{E}[\Delta Y \mid \hat{X}_t(0), Z, D=0]\) via OLS
  3. Generalized propensity score: \(\Pr(D = 1 \mid X_{t-1}, W, Z)\) via random forests (cross-fitted)

The propensity score is estimated using Athey, Tibshirani, and Wager’s (2019) grf package — specifically, probability_forest with \(K\)-fold cross-fitting to avoid overfitting.

The ATT is then estimated using an AIPW (augmented inverse probability weighting) score:

\[ \widehat{ATT}_{DR} = \frac{1}{n_1}\sum_{D_i=1} \hat{\varepsilon}_i - \sum_{D_i=0} \hat{w}_i \, \hat{\varepsilon}_i \]

where \(\hat{\varepsilon}_i = \Delta Y_i - \hat{m}(\hat{X}_{t,i}(0), Z_i)\) are the outcome regression residuals and \(\hat{w}_i\) are Hajek-normalized inverse probability weights.

Double robustness

The AIPW score is consistent if either the outcome model or the propensity score is correctly specified:

  • If the outcome model is correct, the IPW correction averages to zero (no harm done)
  • If the propensity score is correct, it reweights the controls to match the treated covariate distribution, fixing any misspecification in the outcome model

This provides insurance against model misspecification — you get two chances to get it right.

Cross-fitting

The propensity score uses random forests, which are flexible but can overfit. Cross-fitting prevents this:

  1. Split the sample into \(K\) folds
  2. For each fold, estimate the propensity score on the remaining \(K-1\) folds
  3. Predict on the held-out fold

This yields valid inference even with highly flexible ML estimators.

R code

library(badcontrols)

res_ml <- bc_att_gt(
  yname = "Y",
  gname = "G",
  tname = "period",
  idname = "id",
  data = sim$data,
  bad_control_formula = ~X,
  xformla = ~Z + W,
  est_method = "dr_ml"
)

extract_att(res_ml)

When it works well

  • When you want robustness to misspecification of the outcome model
  • When treatment selection depends on covariates in complex, nonlinear ways
  • Moderate to large sample sizes (the random forest PS needs data)

Requirements

The DR/ML method requires the grf package for random forests:

install.packages("grf")

Inference

Both methods produce group-time ATT estimates \(\widehat{ATT}(g,t)\) for each treatment group \(g\) and time period \(t\). These are aggregated into:

  • Overall ATT: weighted average across all \((g,t)\) cells
  • Event study: ATT by event time (periods relative to treatment)
  • Group-specific ATT: ATT for each treatment cohort

Standard errors are computed via the multiplier bootstrap, using the influence function representation of the estimators.

# Overall ATT
extract_att(res)

# The pte infrastructure handles aggregation automatically
summary(res)

Choosing between methods

flowchart TD
    A[Bad control detected] --> B{Relationships linear?}
    B -->|Yes| C[Imputation]
    B -->|No / Unsure| D{Large sample?}
    D -->|Yes| E[DR/ML]
    D -->|No| F[Imputation with flexible formula]

In practice, we recommend running both methods and checking that results are similar. If they diverge substantially, investigate the source of the discrepancy.