flowchart TD
W["W (auxiliary)"] --> D
W --> Xt["X_t"]
Xpre["X_{t-1}"] --> D
Xpre --> Xt
Z --> D
Z --> Xt
D -->|bad!| Xt
The Bad Controls Problem
The setup
Consider a standard staggered DiD setting. You observe panel data with:
- \(Y_{it}\): outcome
- \(D_{it}\): treatment indicator
- \(X_{it}\): a time-varying covariate
- \(Z_i\): time-invariant covariates
You want to estimate the ATT, and you believe conditional parallel trends holds:
\[ \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_t, D = 1] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_t, D = 0] \]
The problem: \(X_t\) is measured after treatment, so it may itself be shifted by treatment.
The DAG
The causal structure for the post-treatment covariate \(X_t\) looks like this (cf. Figure 1 in the paper):
\(W\) is a pre-treatment variable that confounds both treatment \(D\) and the post-treatment covariate \(X_t\). The pre-treatment covariate \(X_{t-1}\) and time-invariant \(Z\) also affect both \(D\) and \(X_t\).
The full causal structure including the outcome is:
flowchart LR
D -->|direct| Y
D -->|bad!| Xt["X_t"]
Xt --> Y
Treatment \(D\) affects both \(Y\) (directly) and \(X_t\) (the bad control). Because \(X_t\) also affects \(Y_t\), there are two causal channels:
- Direct effect: \(D \to Y\)
- Indirect effect: \(D \to X_t \to Y\) (goes through the bad control)
The total ATT includes both channels. Conditioning on post-treatment \(X_t\) blocks the indirect channel and introduces selection bias.
What goes wrong?
Approach 1: Include \(X_t\) directly
When you condition on post-treatment \(X_t\), you are effectively comparing treated and control units with the same observed \(X_t\). But treatment shifts \(X_t\), so you’re comparing:
- Treated units whose \(X_t\) was pulled down by bad luck (despite treatment pushing it up)
- Control units whose \(X_t\) is naturally at that level
This is classic post-treatment selection bias. You recover the direct effect minus a bias term, not the total ATT.
Including \(X_t\): Absorbs the indirect effect \(D \to X \to Y\). Estimates the direct effect (at best), not the total ATT. Generally biased.
Approach 2: Drop \(X_t\) entirely
If parallel trends only holds conditional on \(X_t\), then dropping it means the unconditional parallel trends assumption fails. Your DiD estimate is biased in the other direction.
Dropping \(X_t\): Violates parallel trends if trends differ across \(X\) levels. Omitted variable bias.
Approach 3: Use only \(X_{t-1}\)
Using the pre-treatment value \(X_{t-1}\) avoids the post-treatment selection problem. This works if:
\[ \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_{t-1}, D = 1] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_{t-1}, D = 0] \]
But this is a stronger assumption than the original conditional parallel trends. It requires that lagged \(X\) is sufficient — the current value \(X_t(0)\) adds no information. This may or may not hold.
Using \(X_{t-1}\) only: Works under restrictive conditions (parallel trends conditional on lagged \(X\)). May be biased when \(X_t(0)\) matters beyond \(X_{t-1}\).
The solution: impute \(X_t(0)\)
The key insight: we can’t use the observed \(X_t\) for treated units (it’s contaminated by treatment). But we can predict what \(X_t\) would have been without treatment — the counterfactual \(X_t(0)\).
The idea:
- Among control units, \(X_t\) is not affected by treatment, so \(X_t = X_t(0)\)
- Learn the relationship \(X_t \sim f(X_{t-1}, W, Z)\) from the control group
- For treated units, predict \(\hat{X}_t(0) = f(X_{t-1}, W, Z)\)
- Run DiD using the imputed \(\hat{X}_t(0)\) instead of the observed \(X_t\)
This requires Covariate Unconfoundedness:
\[ X_t(0) \perp\!\!\!\perp D \mid X_{t-1}, W, Z \]
Conditional on pre-treatment variables, the counterfactual covariate distribution is the same for treated and control units. Conditioning on \(W\) is critical because \(W\) confounds both \(D\) and \(X_t(0)\) — without it, the imputation model would inherit the confounding bias.
Imputation: Imputes \(X_t(0)\) for treated units using the control group. Recovers the total ATT under Covariate Unconfoundedness.
Doubly Robust ML: Combines imputation with propensity score weighting. Consistent if either the outcome model or the propensity score is correctly specified. Uses random forests with cross-fitting.
See Estimation for details on both methods, and Worked Example for a full R walkthrough.