The Bad Controls Problem

The setup

Consider a standard staggered DiD setting. You observe panel data with:

  • \(Y_{it}\): outcome
  • \(D_{it}\): treatment indicator
  • \(X_{it}\): a time-varying covariate
  • \(Z_i\): time-invariant covariates

You want to estimate the ATT, and you believe conditional parallel trends holds:

\[ \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_t, D = 1] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_t, D = 0] \]

The problem: \(X_t\) is measured after treatment, so it may itself be shifted by treatment.

The DAG

The causal structure for the post-treatment covariate \(X_t\) looks like this (cf. Figure 1 in the paper):

flowchart TD
    W["W (auxiliary)"] --> D
    W --> Xt["X_t"]
    Xpre["X_{t-1}"] --> D
    Xpre --> Xt
    Z --> D
    Z --> Xt
    D -->|bad!| Xt

\(W\) is a pre-treatment variable that confounds both treatment \(D\) and the post-treatment covariate \(X_t\). The pre-treatment covariate \(X_{t-1}\) and time-invariant \(Z\) also affect both \(D\) and \(X_t\).

The full causal structure including the outcome is:

flowchart LR
    D -->|direct| Y
    D -->|bad!| Xt["X_t"]
    Xt --> Y

Treatment \(D\) affects both \(Y\) (directly) and \(X_t\) (the bad control). Because \(X_t\) also affects \(Y_t\), there are two causal channels:

  1. Direct effect: \(D \to Y\)
  2. Indirect effect: \(D \to X_t \to Y\) (goes through the bad control)

The total ATT includes both channels. Conditioning on post-treatment \(X_t\) blocks the indirect channel and introduces selection bias.

What goes wrong?

Approach 1: Include \(X_t\) directly

When you condition on post-treatment \(X_t\), you are effectively comparing treated and control units with the same observed \(X_t\). But treatment shifts \(X_t\), so you’re comparing:

  • Treated units whose \(X_t\) was pulled down by bad luck (despite treatment pushing it up)
  • Control units whose \(X_t\) is naturally at that level

This is classic post-treatment selection bias. You recover the direct effect minus a bias term, not the total ATT.

Including \(X_t\): Absorbs the indirect effect \(D \to X \to Y\). Estimates the direct effect (at best), not the total ATT. Generally biased.

Approach 2: Drop \(X_t\) entirely

If parallel trends only holds conditional on \(X_t\), then dropping it means the unconditional parallel trends assumption fails. Your DiD estimate is biased in the other direction.

Dropping \(X_t\): Violates parallel trends if trends differ across \(X\) levels. Omitted variable bias.

Approach 3: Use only \(X_{t-1}\)

Using the pre-treatment value \(X_{t-1}\) avoids the post-treatment selection problem. This works if:

\[ \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_{t-1}, D = 1] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_{t-1}, D = 0] \]

But this is a stronger assumption than the original conditional parallel trends. It requires that lagged \(X\) is sufficient — the current value \(X_t(0)\) adds no information. This may or may not hold.

Using \(X_{t-1}\) only: Works under restrictive conditions (parallel trends conditional on lagged \(X\)). May be biased when \(X_t(0)\) matters beyond \(X_{t-1}\).

The solution: impute \(X_t(0)\)

The key insight: we can’t use the observed \(X_t\) for treated units (it’s contaminated by treatment). But we can predict what \(X_t\) would have been without treatment — the counterfactual \(X_t(0)\).

The idea:

  1. Among control units, \(X_t\) is not affected by treatment, so \(X_t = X_t(0)\)
  2. Learn the relationship \(X_t \sim f(X_{t-1}, W, Z)\) from the control group
  3. For treated units, predict \(\hat{X}_t(0) = f(X_{t-1}, W, Z)\)
  4. Run DiD using the imputed \(\hat{X}_t(0)\) instead of the observed \(X_t\)

This requires Covariate Unconfoundedness:

\[ X_t(0) \perp\!\!\!\perp D \mid X_{t-1}, W, Z \]

Conditional on pre-treatment variables, the counterfactual covariate distribution is the same for treated and control units. Conditioning on \(W\) is critical because \(W\) confounds both \(D\) and \(X_t(0)\) — without it, the imputation model would inherit the confounding bias.

Imputation: Imputes \(X_t(0)\) for treated units using the control group. Recovers the total ATT under Covariate Unconfoundedness.

Doubly Robust ML: Combines imputation with propensity score weighting. Consistent if either the outcome model or the propensity score is correctly specified. Uses random forests with cross-fitting.

See Estimation for details on both methods, and Worked Example for a full R walkthrough.