Agents

Twenty specialists that never edit each other’s work

Four Architectural Rules

The system uses 20 specialized agents organized into worker-critic pairs. Four rules explain most of the system’s behavior.

Worker-Critic Pairing

Every creator has a paired destroyer. The critic reads the artifact, scores it against a rubric, and lists what must change — but never edits the file itself. If the score falls below 80, the worker revises and the critic re-evaluates. This cycle repeats up to 3 rounds.

Worker Critic What’s Reviewed
librarian librarian-critic Literature coverage, gaps, recency
explorer explorer-critic Data feasibility, quality, identification fit
strategist strategist-critic Identification validity, assumptions, robustness
theorist theorist-critic Proof validity, assumption minimality, notation
data-engineer coder-critic Data pipeline quality, reproducibility
coder coder-critic Code quality, strategy alignment, numerical discipline
writer writer-critic Manuscript polish, claims-evidence, LaTeX format
storyteller storyteller-critic Talk structure, visual quality, content fidelity

Three agents have no paired critic because they are infrastructure (orchestrator, verifier) or already serve as reviewers (editor, domain-referee, methods-referee).

Separation of Powers

Critics never create. Creators never self-score. If a critic invocation produces a file in scripts/, paper/, or paper/talks/, the orchestrator flags it as a violation. If a creator reports its own score, the orchestrator discards it and dispatches the critic.

This prevents the system from rubber-stamping its own output. A critic who fixes their own findings has incentive to find only fixable issues. Separation keeps criticism honest.

Cold-Read Protocol

Critics evaluate blind. On every round, the critic receives only the artifact, its scoring rubric, the severity level, and the content invariants. It does not receive: what round this is, what the worker struggled with, prior critic reports, or the research journal. A critic that knows the worker has already tried twice will soften its evaluation. The cold-read protocol prevents this drift.

Three-Round Limit and Escalation

Three strikes, then escalation. If a worker-critic pair cannot converge after 3 rounds, the orchestrator routes the problem elsewhere — either to a different agent or to the user. Escalation targets are declared per agent in permissions.md. When the coder and coder-critic deadlock, the issue escalates to the strategist-critic (is the memo implementable?). When the strategist and strategist-critic deadlock, it escalates to the user (fundamental design question). This prevents infinite loops while ensuring hard problems get human judgment.


The Skill-Centric Model

Agents used to be monolithic. Each agent file contained identity, checklists, templates, rubrics, and scoring tables — 200 to 474 lines of coupled operational knowledge. Updating a checklist meant editing an agent. Adding a paper type meant editing every agent that touches paper types.

The problem: operational knowledge evolved faster than identity. A new estimator package required updating checklists, not changing who the coder is. Coupling them meant agent files churned constantly.

Now agents are lean identity files. Each one is 50–140 lines declaring: role, voice, constraints, tool access, paper-type awareness, and pointers to where the operational knowledge lives. The coder file says “read analyze/templates/pre-code-report.md before writing code” — it does not contain the pre-code report template itself.

The operational knowledge — checklists, rubrics, scoring tables, templates, gotchas — lives in skill folders. When the orchestrator dispatches an agent, it tells the agent which skill invoked it and the agent reads the relevant templates from that skill’s directory. This means:

  • Agent files change rarely (identity is stable)
  • Skill folders evolve freely (knowledge improves every project)
  • The same agent can serve multiple skills with different templates
  • Adding a new paper type means adding templates, not rewriting agents

A concrete example: the coder agent file (136 lines) declares its role, tool access, workflow stages, engineering standards, and output expectations. But when /analyze dispatches it, the coder reads analyze/templates/pre-code-report.md for the pre-code report format, analyze/templates/r-script-structure.R for the scaffold, and analyze/gotchas.md for known pitfalls. None of that lives in the agent file. If we learn a new gotcha tomorrow, we add one line to analyze/gotchas.md — the agent file never changes.

The strategist-critic works the same way. Its identity file (62 lines) says: “You are a top-5 journal referee. Run 4 sequential phases. Read the causal audit template.” The 4-phase audit protocol, the design-specific checklists, and the scoring rubric all live in review/templates/ and review/config/. Updating how we evaluate RDD papers means editing a template, not touching the critic’s identity.

For full agent specifications, see the .claude/agents/ directory — each file is self-contained and authoritative.


Agent Groups

Rather than profiling each agent individually (those profiles live in .claude/agents/), we organize by function. Each group activates at a specific pipeline phase, produces specific artifacts, and has specific quality gates.

Discovery Pair

The librarian searches top-5 general-interest journals, field journals, NBER working papers, and citation chains to produce an annotated bibliography, a BibTeX file, a frontier map (what has been done), and a positioning statement (where the new paper fits). The explorer searches for datasets — public, administrative, survey — and grades each by feasibility: variable coverage, access requirements, panel structure, and known data issues.

Both run in parallel from the start of a project. Neither depends on the other. The librarian-critic checks for coverage gaps (missing seminal papers, missing methods literature, over-reliance on working papers). The explorer-critic checks measurement validity (does the proposed variable actually capture the concept?) and identification compatibility (does this data support the likely design?).

Together they answer a single question: what has been done, and what data could let us do something new?

Activates: From a research idea or spec. No prerequisites. Produces: quality_reports/literature/ and quality_reports/data-assessment/ Quality gate: Both critics must score >= 80 before Strategy activates.

Strategy Pair

The strategist is paper-type aware. Given the literature and data assessment, it classifies the paper type (reduced-form, structural, theory+empirics, descriptive) and applies the corresponding strategy template. For reduced-form papers, it assesses the identification landscape, proposes designs ranked by credibility, and specifies design-specific estimation guidance (modern DiD estimators for staggered treatment, effective F for IV, MSE-optimal bandwidth for RDD). For structural papers, it specifies the model environment, decision problem, equilibrium concept, identification of each parameter, and counterfactual design. It always anticipates the top 5 referee objections.

The theorist activates conditionally — only for papers that need formal proofs. It drafts assumptions, definitions, propositions, and proofs at journal-publication rigor, with every theorem linked to the empirical strategy. Its critic runs a 4-phase proof review with early stopping on logical gaps.

The strategist-critic is the system’s most important gatekeeper. It runs a 4-phase sequential audit: identify the design, check whether the core assumptions hold (design-specific checklists), evaluate inference and code-theory alignment, then check robustness. Critical finding in Phase 2 stops the review — no point checking robustness if parallel trends are violated.

Activates: After at least one Discovery output (literature OR data assessment). Produces: quality_reports/strategy/ and (when activated) quality_reports/theory/ Quality gate: Strategist-critic score >= 80 before Execution activates.

Execution Trio

The data-engineer takes raw data and produces clean, analysis-ready datasets. It documents every sample restriction with counts, constructs variables per strategy memo definitions, merges datasets with documented merge rates, and produces publication-quality figures (custom themes, colorblind-safe palettes, no titles inside plots).

The coder translates the strategy memo into numbered analysis scripts (00_master.R through 07_tables.R) with engineering discipline: numerical guards, pre-allocation, a single seed, here() for all paths, and a paper-to-code naming map linking every \(\beta\) and \(Y_{it}\) to its code variable. It implements the main specification, all robustness checks from the memo, and exports bare tabular environments for tables.

The writer drafts the manuscript after code is approved. It refuses to write Results without actual output files in paper/tables/ — this gate exists because the distinction between “the strategy predicts X” and “the data shows X” is the difference between a proposal and a paper. It maintains a claim-source map tracing every numerical claim to a specific script line and output file.

The coder-critic reviews both the data-engineer and coder across 16 categories: strategic alignment (does the code match the memo?), code quality (reproducibility, numerics, function design), and infrastructure (checkpointing, error handling, prohibited patterns). The writer-critic runs 8 checks: structure, claims-evidence, identification fidelity, AI-pattern detection, LaTeX format, compilation, voice fidelity, and notation consistency.

Activates: Data-engineer and coder after strategist-critic score >= 80. Writer after coder-critic score >= 80. Produces: data/cleaned/, scripts/R/, paper/tables/, paper/figures/, paper/sections/ Quality gate: Coder-critic and writer-critic both >= 80 before Peer Review.

Review Team

The editor performs desk review: reads the abstract and introduction, skims the contribution, verifies novelty, and decides whether to send to referees or desk reject. When it sends, it assigns each referee a contrasting intellectual disposition — STRUCTURAL (“where’s the mechanism?”), CREDIBILITY (“show me the pre-trends”), MEASUREMENT (“how is this measured?”), POLICY (“does this generalize?”), THEORY (“what does the model predict?”), or SKEPTIC (“what would make this go away?”). Each referee also receives randomized pet peeves drawn from pools of critical and constructive tendencies.

The domain-referee evaluates substance: contribution and novelty (30%), literature positioning (25%), substantive arguments (20%), external validity (15%), and journal fit (10%). The methods-referee evaluates methods with paper-type-specific weights: identification strategy, estimation, inference, robustness, and replication readiness. Neither sees the other’s report.

After both reports arrive, the editor classifies each concern as FATAL (cannot be fixed), ADDRESSABLE (fixable with revision), or TASTE (referee preference), resolves disagreements with explicit reasoning, and issues a decision.

Activates: After writer-critic and coder-critic both >= 80. Invoked via /review --peer. Produces: quality_reports/reviews/ Quality gate: Editorial decision of Accept or Minor Revisions to proceed to Submission.

Presentation Pair

The storyteller creates talks from the paper in four formats (job market 40–50 slides, seminar 25–35, short 10–15, lightning 3–5) and two output types (Beamer PDF, Quarto RevealJS HTML). Paper-type aware: reduced-form talks lead with the policy question and show variation; structural talks build the model step by step with progressive reveal and pay off with the counterfactual; theory+empirics talks present competing explanations and a distinguishing prediction; descriptive talks lead with what is missing in current measures.

The storyteller-critic checks narrative flow (does the hook work?), visual quality (text overflow, font sizes, one idea per slide), content fidelity (numbers match the paper), and compilation. Presentation scores are advisory — they do not block commits or PRs.

Activates: After writer-critic score >= 80. Can run parallel with Peer Review. Produces: paper/talks/ or paper/quarto/ Quality gate: Advisory only.

Infrastructure

The orchestrator is the project manager. It reads permissions.md to determine which agents can activate, validates pre-dispatch (do REQUIRES artifacts exist?) and post-completion (do PRODUCES artifacts exist with required sections?), launches worker-critic pairs in parallel when independent, tracks strike counts, computes weighted overall scores, and maintains structured pipeline state in pipeline_state.json. It makes no research decisions — when judgment is needed, it escalates to the user with a clear question.

The verifier checks mechanical correctness in two modes. Standard mode (between phases): LaTeX compilation, script execution, file integrity, output freshness. Submission mode (before journal deposit): package inventory, dependency verification, data provenance, end-to-end execution, output cross-reference, and AEA-format README completeness. It scores binary — 0 or 100, contributing 5% weight to the overall project score.

Activates: Orchestrator runs throughout. Verifier runs between phases and at submission. Produces: quality_reports/pipeline_state.json, quality_reports/verification_report.md Quality gate: Verifier must PASS (100) for submission.


Phase Flow

Research is not a waterfall, but it has a direction. Agents activate by dependency, not by sequence number. If you already have data and a draft, you can enter at Peer Review. If a referee says “control for X,” the orchestrator routes back to the coder alone — not through the entire pipeline.

The dependency chain:

Discovery ──→ Strategy ──→ Execution ──→ Peer Review ──→ Submission
(librarian,    (strategist,   (data-engineer,  (editor,         (verifier)
 explorer)      theorist)      coder, writer)   referees)

                                                    ↕
                                              Presentation
                                              (storyteller)

Each arrow is gated by a critic score. The orchestrator checks permissions.md before every dispatch: are the REQUIRES artifacts present? Has the prerequisite critic scored >= 80? If not, the agent does not activate.

Re-entry is allowed at any phase except Submission (terminal). A Major Revisions decision from Peer Review triggers re-entry at whatever phase the referee comments require — sometimes just the writer (clarification), sometimes back to the coder (new analysis), sometimes back to the strategist (fundamental design challenge).

The system distinguishes between phases (sequential gates) and parallel groups (concurrent work within a phase). The librarian and explorer run concurrently. The data-engineer and coder run concurrently. But the writer waits for the coder to finish. These relationships are all declared in one file — permissions.md — not hardcoded in agent logic.


Quick Reference

All 20 agents at a glance.

Agent Phase Creates / Checks Paired With
librarian Discovery Annotated bibliography, BibTeX, frontier map librarian-critic
librarian-critic Discovery Checks coverage, gaps, journal quality, recency librarian
explorer Discovery Ranked data sources, feasibility grades explorer-critic
explorer-critic Discovery Checks measurement validity, sample selection, ID fit explorer
strategist Strategy Strategy memo, pseudo-code, robustness plan strategist-critic
strategist-critic Strategy Checks assumptions, inference, robustness (4 paper types) strategist
theorist Strategy Assumptions, theorems, proofs, notation glossary theorist-critic
theorist-critic Strategy 4-phase proof review with early-stop on critical gaps theorist
data-engineer Execution Clean datasets, codebooks, publication figures coder-critic
coder Execution Analysis scripts, tables, figures, results summary coder-critic
coder-critic Execution 16 checks: strategy alignment, code quality, numerics coder, data-engineer
writer Execution Paper manuscript, claim-source map writer-critic
writer-critic Execution 8 checks: structure, claims, format, voice, notation writer
editor Peer Review Desk review, referee dispatch, editorial decision None
domain-referee Peer Review Blind review (substance: contribution, literature, scope) None
methods-referee Peer Review Blind review (methods: ID, estimation, inference, robustness) None
storyteller Presentation Beamer / Quarto talks in 4 formats storyteller-critic
storyteller-critic Presentation Checks narrative, visuals, content fidelity, scope storyteller
orchestrator All Agent dispatch, quality gates, pipeline state None
verifier Submission Compilation, execution, file integrity, AEA audit None

Where the Details Live

This page teaches the architecture — the four rules, the skill-centric model, the group relationships, and the phase flow. The operational specifics (rubrics, scoring tables, checklists, deduction schedules) live elsewhere:

  • Agent identity files: .claude/agents/*.md — role, voice, constraints, paper-type awareness
  • Skill templates: {skill}/templates/ — checklists, protocols, report formats
  • Skill config: {skill}/config/ — scoring rubrics, tolerance thresholds
  • Permission registry: .claude/rules/permissions.md — REQUIRES, PRODUCES, escalation targets, quality weights
  • Lifecycle validation: .claude/rules/lifecycle.md — pre-dispatch and post-completion checks

The permission registry is the single source of truth for agent dispatch. Adding a new agent means adding one entry there and one file in .claude/agents/. No other file needs to change.