The Bad Controls Problem

The setup

Consider a standard staggered DiD setting. You observe panel data with:

$Y_{it}$: outcome
$D_{it}$: treatment indicator
$X_{it}$: a time-varying covariate
$Z_i$: time-invariant covariates

You want to estimate the ATT, and you believe conditional parallel trends holds:

\[ \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_t, D = 1] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_t, D = 0] \]

The problem: $X_t$ is measured after treatment, so it may itself be shifted by treatment.

The DAG

The causal structure for the post-treatment covariate $X_t$ looks like this (cf. Figure 1 in the paper):

flowchart TD
    W["W (auxiliary)"] --> D
    W --> Xt["X_t"]
    Xpre["X_{t-1}"] --> D
    Xpre --> Xt
    Z --> D
    Z --> Xt
    D -->|bad!| Xt

$W$ is a pre-treatment variable that confounds both treatment $D$ and the post-treatment covariate $X_t$. The pre-treatment covariate $X_{t-1}$ and time-invariant $Z$ also affect both $D$ and $X_t$.

The full causal structure including the outcome is:

flowchart LR
    D -->|direct| Y
    D -->|bad!| Xt["X_t"]
    Xt --> Y

Treatment $D$ affects both $Y$ (directly) and $X_t$ (the bad control). Because $X_t$ also affects $Y_t$, there are two causal channels:

Direct effect: $D \to Y$
Indirect effect: $D \to X_t \to Y$ (goes through the bad control)

The total ATT includes both channels. Conditioning on post-treatment $X_t$ blocks the indirect channel and introduces selection bias.

What goes wrong?

Approach 1: Include $X_t$ directly

When you condition on post-treatment $X_t$, you are effectively comparing treated and control units with the same observed $X_t$. But treatment shifts $X_t$, so you’re comparing:

Treated units whose $X_t$ was pulled down by bad luck (despite treatment pushing it up)
Control units whose $X_t$ is naturally at that level

This is classic post-treatment selection bias. You recover the direct effect minus a bias term, not the total ATT.

Including $X_t$: Absorbs the indirect effect $D \to X \to Y$. Estimates the direct effect (at best), not the total ATT. Generally biased.

Approach 2: Drop $X_t$ entirely

If parallel trends only holds conditional on $X_t$, then dropping it means the unconditional parallel trends assumption fails. Your DiD estimate is biased in the other direction.

Dropping $X_t$: Violates parallel trends if trends differ across $X$ levels. Omitted variable bias.

Approach 3: Use only $X_{t-1}$

Using the pre-treatment value $X_{t-1}$ avoids the post-treatment selection problem. This works if:

\[ \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_{t-1}, D = 1] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_{t-1}, D = 0] \]

But this is a stronger assumption than the original conditional parallel trends. It requires that lagged $X$ is sufficient — the current value $X_t(0)$ adds no information. This may or may not hold.

Using $X_{t-1}$ only: Works under restrictive conditions (parallel trends conditional on lagged $X$). May be biased when $X_t(0)$ matters beyond $X_{t-1}$.

The solution: impute $X_t(0)$

The key insight: we can’t use the observed $X_t$ for treated units (it’s contaminated by treatment). But we can predict what $X_t$ would have been without treatment — the counterfactual $X_t(0)$.

The idea:

Among control units, $X_t$ is not affected by treatment, so $X_t = X_t(0)$
Learn the relationship $X_t \sim f(X_{t-1}, W, Z)$ from the control group
For treated units, predict $\hat{X}_t(0) = f(X_{t-1}, W, Z)$
Run DiD using the imputed $\hat{X}_t(0)$ instead of the observed $X_t$

This requires Covariate Unconfoundedness:

\[ X_t(0) \perp\!\!\!\perp D \mid X_{t-1}, W, Z \]

Conditional on pre-treatment variables, the counterfactual covariate distribution is the same for treated and control units. Conditioning on $W$ is critical because $W$ confounds both $D$ and $X_t(0)$ — without it, the imputation model would inherit the confounding bias.

Imputation: Imputes $X_t(0)$ for treated units using the control group. Recovers the total ATT under Covariate Unconfoundedness.

Doubly Robust ML: Combines imputation with propensity score weighting. Consistent if either the outcome model or the propensity score is correctly specified. Uses random forests with cross-fitting.

See Estimation for details on both methods, and Worked Example for a full R walkthrough.

--- title: "The Bad Controls Problem" --- ## The setup Consider a standard staggered DiD setting. You observe panel data with: - $Y_{it}$: outcome - $D_{it}$: treatment indicator - $X_{it}$: a time-varying covariate - $Z_i$: time-invariant covariates You want to estimate the ATT, and you believe **conditional** parallel trends holds: $$ \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_t, D = 1] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_t, D = 0] $$ The problem: $X_t$ is measured **after** treatment, so it may itself be shifted by treatment. ## The DAG The causal structure for the post-treatment covariate $X_t$ looks like this (cf. Figure 1 in the paper): ```{mermaid} %%| fig-width: 6 flowchart TD W["W (auxiliary)"] --> D W --> Xt["X_t"] Xpre["X_{t-1}"] --> D Xpre --> Xt Z --> D Z --> Xt D -->|bad!| Xt ``` $W$ is a pre-treatment variable that confounds both treatment $D$ and the post-treatment covariate $X_t$. The pre-treatment covariate $X_{t-1}$ and time-invariant $Z$ also affect both $D$ and $X_t$. The full causal structure including the outcome is: ```{mermaid} %%| fig-width: 5 flowchart LR D -->|direct| Y D -->|bad!| Xt["X_t"] Xt --> Y ``` Treatment $D$ affects both $Y$ (directly) and $X_t$ (the bad control). Because $X_t$ also affects $Y_t$, there are **two causal channels**: 1. **Direct effect**: $D \to Y$ 2. **Indirect effect**: $D \to X_t \to Y$ (goes through the bad control) The **total ATT** includes both channels. Conditioning on post-treatment $X_t$ blocks the indirect channel and introduces selection bias. ## What goes wrong? ### Approach 1: Include $X_t$ directly When you condition on post-treatment $X_t$, you are effectively comparing treated and control units **with the same observed $X_t$**. But treatment shifts $X_t$, so you're comparing: - Treated units whose $X_t$ was *pulled down* by bad luck (despite treatment pushing it up) - Control units whose $X_t$ is naturally at that level This is classic **post-treatment selection bias**. You recover the direct effect minus a bias term, not the total ATT. ::: {.method-card .biased} **Including $X_t$**: Absorbs the indirect effect $D \to X \to Y$. Estimates the direct effect (at best), not the total ATT. Generally biased. ::: ### Approach 2: Drop $X_t$ entirely If parallel trends only holds *conditional* on $X_t$, then dropping it means the unconditional parallel trends assumption fails. Your DiD estimate is biased in the other direction. ::: {.method-card .biased} **Dropping $X_t$**: Violates parallel trends if trends differ across $X$ levels. Omitted variable bias. ::: ### Approach 3: Use only $X_{t-1}$ Using the pre-treatment value $X_{t-1}$ avoids the post-treatment selection problem. This works if: $$ \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_{t-1}, D = 1] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid X_{t-1}, D = 0] $$ But this is a **stronger** assumption than the original conditional parallel trends. It requires that *lagged* $X$ is sufficient --- the current value $X_t(0)$ adds no information. This may or may not hold. ::: {.method-card .biased} **Using $X_{t-1}$ only**: Works under restrictive conditions (parallel trends conditional on lagged $X$). May be biased when $X_t(0)$ matters beyond $X_{t-1}$. ::: ## The solution: impute $X_t(0)$ The key insight: we can't use the *observed* $X_t$ for treated units (it's contaminated by treatment). But we **can** predict what $X_t$ *would have been* without treatment --- the counterfactual $X_t(0)$. The idea: 1. Among **control** units, $X_t$ is not affected by treatment, so $X_t = X_t(0)$ 2. Learn the relationship $X_t \sim f(X_{t-1}, W, Z)$ from the control group 3. For **treated** units, predict $\hat{X}_t(0) = f(X_{t-1}, W, Z)$ 4. Run DiD using the imputed $\hat{X}_t(0)$ instead of the observed $X_t$ This requires **Covariate Unconfoundedness**: $$ X_t(0) \perp\!\!\!\perp D \mid X_{t-1}, W, Z $$ Conditional on pre-treatment variables, the counterfactual covariate distribution is the same for treated and control units. Conditioning on $W$ is critical because $W$ confounds both $D$ and $X_t(0)$ --- without it, the imputation model would inherit the confounding bias. ::: {.method-card .correct} **Imputation**: Imputes $X_t(0)$ for treated units using the control group. Recovers the total ATT under Covariate Unconfoundedness. ::: ::: {.method-card .correct} **Doubly Robust ML**: Combines imputation with propensity score weighting. Consistent if *either* the outcome model *or* the propensity score is correctly specified. Uses random forests with cross-fitting. ::: See [Estimation](estimation.qmd) for details on both methods, and [Worked Example](example.qmd) for a full R walkthrough.

The Bad Controls Problem

The setup

The DAG

What goes wrong?

Approach 1: Include \(X_t\) directly

Approach 2: Drop \(X_t\) entirely

Approach 3: Use only \(X_{t-1}\)

The solution: impute \(X_t(0)\)