Rules & Invariants
What the system enforces and why
How Rules Work
The system enforces quality through rule files — machine-readable documents that tell each agent what to check, what to flag, and how severely to penalize violations. Every critic reads the relevant rules before reviewing. Every worker reads them before creating.
This page is a reference for all 11 rule files, the 22 content invariants, the severity gradient, and the quality gates. If you fork the repo and change nothing else, these rules still apply.
Rule Files Overview
Eleven files in .claude/rules/ govern the system. Each file has a clear scope — no rule appears in two places.
| File | Purpose | Primary readers |
|---|---|---|
agents.md |
Worker-critic pairs, separation of powers, 3-strike escalation | Orchestrator, all agents |
content-invariants.md |
22 non-negotiable checks (paper, code, talk, traceability) | All critics, verifier, lint hook |
content-standards.md |
Table format, figure format, PDF processing, exploration protocol | Coder, writer, coder-critic, writer-critic |
lifecycle.md |
Pre-dispatch and post-completion validation protocol | Orchestrator |
logging.md |
Session report, research journal, and pipeline state formats | Orchestrator, all agents |
meta-governance.md |
Dual-nature rule (template vs. working project), learning promotion | Orchestrator, user |
permissions.md |
Agent registry: phase, dependencies, outputs, escalation targets, quality weights | Orchestrator |
quality.md |
Scoring protocol, gate thresholds, severity gradient, deduction scaling | Orchestrator, all critics |
revision.md |
R&R cycle: comment classification and routing | Orchestrator, writer, coder |
workflow.md |
Plan-first protocol, orchestrator loop, dependency graph, context management | Orchestrator, user |
working-paper-format.md |
LaTeX preamble standard, title page format, bibliography setup | Writer, writer-critic, verifier |
What each file governs
agents.md defines the adversarial pairing system. Every worker has a paired critic. Critics never create artifacts; creators never self-score. When a pair fails to converge after 3 rounds, the file specifies where to escalate (often back to the user for fundamental disagreements).
content-invariants.md lists the 22 non-negotiable rules that every agent checks. These are the “constitution” of the system — violations produce deductions, not suggestions. Critics cite invariant numbers (e.g., “violates INV-3”) in their reports. See Section 2 for the full table.
content-standards.md specifies how tables, figures, and PDFs should look. It covers booktabs formatting, coefficient display conventions (including AEA no-stars policy), preferred R packages (modelsummary, fixest::etable), figure typography (serif fonts, no in-plot titles), and the exploration folder lifecycle.
lifecycle.md defines PRE-dispatch and POST-completion validation. Before dispatching any agent, the Orchestrator checks that required input artifacts exist and contain the right sections. After completion, it verifies outputs were created and the critic score was recorded. This prevents launching agents with missing inputs or advancing past agents with missing outputs.
logging.md standardizes three persistence mechanisms: the session report (append-only human-readable log), the research journal (structured agent-by-agent entries with scores), and the pipeline state JSON (machine-readable status for session recovery).
meta-governance.md addresses the repo’s dual nature as both a working project and a public template. The “one rule” is simple: before committing, ask whether another researcher forking the repo would benefit. Project-specific state goes in .claude/state/; reusable patterns go in the repo itself. It also governs learning promotion — patterns validated across 3+ projects can be elevated to new invariants, but only with user approval.
permissions.md is the agent registry. Each agent’s entry declares its phase, parallel group, required inputs, expected outputs, paired critic, escalation target, and quality weight. The Orchestrator reads this file to determine what to dispatch and when. Adding a new agent means adding an entry here — no other file needs to change.
quality.md defines how individual critic scores aggregate into the project-wide score that gates submission. It also specifies the severity gradient — the same violation costs more in later phases. See Section 3 and Section 4 for details.
revision.md handles the R&R cycle. When real referee reports arrive, /revise classifies each comment as NEW ANALYSIS (routed to coder), CLARIFICATION (routed to writer), DISAGREE (flagged for user), or MINOR (writer handles directly). The system never autonomously pushes back on referees — DISAGREE items always require human judgment.
workflow.md covers the plan-first protocol (plan before coding, save plans to disk, get user approval), the orchestrator’s dependency-driven loop (identify, dispatch, review, verify, score), standalone vs. pipeline mode, and context management (compaction discipline, session recovery).
working-paper-format.md specifies the LaTeX preamble, title page conventions, abstract format, section ordering, and bibliography setup. It distinguishes required items (blocking deductions if violated) from recommended items (advisory, no penalty). The writer-critic checks every paper against this standard.
Content Invariants
These are non-negotiable. Violations produce deductions, not suggestions. Critics cite invariant numbers (e.g., “violates INV-7”) in their reports.
Full Invariant Table
| # | Rule | Category | Enforced by | Typical deduction |
|---|---|---|---|---|
| INV-1 | Every table has notes (variables, sample, source) | Paper | writer-critic | -5 |
| INV-2 | Every figure has a caption with note (what, how to read, source) | Paper | writer-critic | -5 |
| INV-3 | No \hline — use booktabs (\toprule, \midrule, \bottomrule). No vertical rules |
Paper | writer-critic | -3 |
| INV-4 | Significance stars follow journal profile (AEA: no stars) | Paper | writer-critic | -3 |
| INV-5 | Abstract is 150 words or fewer | Paper | writer-critic | -5 |
| INV-6 | JEL codes and keywords present after abstract | Paper | writer-critic | -5 |
| INV-7 | Notation consistent across all sections | Paper | writer-critic | -5 |
| INV-8 | Every causal claim has a corresponding identification section | Paper | writer-critic | -10 |
| INV-9 | biblatex + biber, not natbib + bibtex |
Paper | writer-critic, verifier | -3 |
| INV-10 | hyperref loaded second-to-last; cleveref loaded after it |
Paper | writer-critic, verifier | -2 |
| INV-11 | Numbers in text match tables/figures exactly | Paper | writer-critic | -5 |
| INV-12 | No titles inside figures; titles go in LaTeX \caption{} |
Paper | writer-critic | -3 |
| INV-13 | Scripts export bare tabular — no \begin{table}, \caption, or notes |
Paper / Code | writer-critic, coder-critic | -3 |
| INV-14 | set.seed() called exactly once, at top, if stochastic |
Code | coder-critic, verifier | -5 |
| INV-15 | All packages loaded at top, before any computation | Code | coder-critic, verifier | -3 |
| INV-16 | No absolute paths; use here() / pathlib.Path / joinpath(@__DIR__) |
Code | coder-critic, verifier | -5 |
| INV-17 | No growing vectors in loops; pre-allocate or vectorize | Code | coder-critic | -3 |
| INV-18 | Output files go to the path specified in CLAUDE.md | Code | coder-critic | -3 |
| INV-19 | No prohibited functions (setwd(), rm(list=ls()), install.packages(), attach()) |
Code | coder-critic, verifier | -5 |
| INV-20 | Talk notation matches paper exactly | Talk | storyteller-critic | -5 |
| INV-21 | Every claim on a slide is traceable to the paper | Talk | storyteller-critic | -5 |
| INV-22 | Every numerical claim has an entry in the claim-source map | Traceability | writer-critic | -5 |
Invariants by Category
Paper (INV-1 through INV-13)
Most paper invariants enforce standard economics formatting that journals expect. A few deserve explanation:
INV-4 (significance stars) adapts to the target journal. AEA journals prohibit significance stars entirely — report standard errors and confidence intervals instead. Working papers default to the * p < 0.10, ** p < 0.05, *** p < 0.01 convention. The journal profile determines which convention applies.
INV-7 (notation consistency) catches a common drafting problem: using \(\beta\) for the treatment effect in Section 3 but \(\delta\) in Section 6, or using \(i\) for individuals in one place and firms in another. The same symbol must mean the same thing everywhere.
INV-8 (causal claims need identification) prevents the most common referee complaint in empirical economics. If you write “X causes Y,” you need an identification section explaining how you separate causation from correlation. Descriptive papers should not use causal language at all.
INV-11 (numbers match) catches stale values — when you re-run the analysis and get 0.045 but the text still says 0.052 from a previous run. The writer-critic cross-references every number in the text against the tables and figures.
INV-13 (bare tabular export) enforces the division of labor between code and paper. R/Python/Julia scripts produce the raw table content. The main.tex file wraps it with \begin{table}, \caption{}, and notes. This means you can change a caption without re-running code, and code changes automatically flow into the paper.
Code (INV-14 through INV-19)
Code invariants enforce reproducibility and portability:
INV-14 (single seed) ensures stochastic results are reproducible. One set.seed() call at the top of the master script, not scattered throughout. Multiple seeds make debugging non-deterministic failures nearly impossible.
INV-16 (no absolute paths) is the single most common portability failure. Code that works on your machine but breaks on a collaborator’s because of /Users/yourname/... paths is not reproducible code. Use here::here() in R, pathlib.Path in Python, or joinpath(@__DIR__, ...) in Julia.
INV-19 (prohibited functions) bans functions that break reproducibility or are dangerous in production. setwd() changes global state and breaks when paths differ across machines. rm(list = ls()) destroys the workspace and breaks interactive debugging. install.packages() in scripts is a side effect that should not run automatically.
Talk (INV-20 through INV-21)
Talk invariants enforce fidelity between the presentation and the paper:
INV-20 (notation matches) prevents the common problem of simplifying notation for slides and introducing inconsistencies. If the paper uses \(\hat{\beta}_{ATT}\), the slides must use the same symbol — not \(\beta\) or \(\hat{\tau}\).
INV-21 (traceable claims) prevents “orphan results” that appear on slides but not in the paper. Every number, every claim, every figure on a slide must trace back to the manuscript.
Traceability (INV-22)
INV-22 (claim-source map) is the audit trail. Every numerical claim in the manuscript — “treatment effect of 4.5 percentage points” — must have an entry in quality_reports/claim_source_map_{project}.md that traces it to a specific script line and output file. This makes it possible for a referee (or a future you) to verify any number in the paper by following the chain from text to table to code.
Severity Gradient
Critics do not apply the same harshness at every stage. A missing citation in an early brainstorm is a gentle nudge; the same gap in a near-final manuscript is a serious deduction. The Orchestrator tells each critic which severity level to use.
Phase-Based Severity
| Phase | Critic stance | Rationale |
|---|---|---|
| Discovery | Encouraging (low) | Early ideas need space to develop. Over-criticizing kills exploration. |
| Strategy | Constructive (medium) | Identification must be sound, but the critic should suggest alternatives, not just reject. |
| Execution | Strict (high) | Code and paper are near-final. Bugs found here are expensive to fix later. |
| Peer Review | Adversarial (maximum) | Simulates real referees. The goal is to find every weakness before a journal does. |
| Presentation | Professional (medium-high) | Talks should be polished, but scores are advisory — they do not block submission. |
The Orchestrator includes the severity level explicitly in the critic’s prompt:
You are reviewing at SEVERITY: HIGH (Execution phase).
Flag all issues. Do not suggest "consider" --- state what must change.
Deduction Scaling
The same violation costs more in later phases. This table shows how four representative issues scale:
| Issue | Discovery | Strategy | Execution | Peer Review |
|---|---|---|---|---|
| Missing citation | -2 | -5 | -10 | -15 |
| Notation inconsistency | -1 | -3 | -5 | -5 |
| Hedging language | — | — | -3 | -5 |
| Missing robustness check | — | -5 | -15 | -20 |
The principle: early phases are about getting the direction right. Late phases are about getting the details right. A notation inconsistency in Discovery is a reminder; the same inconsistency in Execution means the writer is not reading their own paper carefully enough.
The simulated referees are intentionally harsh because real referees are harsh. A system that gives you 95/100 and then a journal rejects the paper has failed. The adversarial stance exists to catch problems before submission, not after. If the paper survives the system’s referees, it is better prepared for the real ones.
Quality Gates
Quality scores determine what actions are allowed. The system uses a weighted aggregate of individual critic scores, with hard thresholds at each gate.
The Four Gates
| Gate | Overall score | Per-component minimum | What it means |
|---|---|---|---|
| Commit | >= 80 | None | Work can be committed to the repository |
| PR | >= 90 | None | Work can be opened as a pull request |
| Submission | >= 95 | >= 80 per component | Work can be submitted to a journal |
| Blocked | < 80 | — | Work cannot be committed; issues must be fixed first |
How Scores Aggregate
Each critic produces a score from 0 to 100 by starting at 100 and deducting for issues found. These individual scores combine via a weighted average:
| Component | Weight | Source |
|---|---|---|
| Literature coverage | 10% | librarian-critic |
| Data quality | 10% | explorer-critic |
| Identification validity | 25% | strategist-critic |
| Theory (when present) | 20% | theorist-critic |
| Code quality | 15% | coder-critic |
| Paper quality | 25% | Avg(domain-referee + methods-referee) |
| Manuscript polish | 10% | writer-critic |
| Replication readiness | 5% | Verifier (pass/fail, mapped to 0 or 100) |
Theory weight applies only to papers with a formal theory section (econometric methods, theory + empirics, structural identification). For applied papers using off-the-shelf estimators, the theory row is excluded and remaining weights renormalize.
When Components Are Missing
Not every project has every component. If a component has not been scored:
- It is excluded from the weighted average
- Remaining weights are renormalized proportionally
- Example: no literature review yet — weights become 11%, 28%, 17%, 28%, 11%, 6%
This means you can run partial pipelines (e.g., just Strategy + Execution) and still get meaningful scores.
The Per-Component Minimum
For submission, the aggregate score must be >= 95 and no individual component can be below 80. A perfect literature review cannot compensate for broken identification. A beautiful manuscript cannot compensate for code that does not reproduce.
This rule prevents the “average-up” problem — where excellence in one area masks serious deficiencies in another.
Recovering from a Low Score
When a score falls below the gate threshold:
- The Orchestrator identifies which components are blocking
- It re-dispatches the relevant worker-critic pair to fix the flagged issues
- After fixes, the critic re-reviews (up to 3 rounds per pair)
- If 3 rounds fail, the system escalates — often back to the user for a design decision
The goal is convergence, not perfection on the first pass. Most artifacts reach 80+ by round 2.
A score of 95 is difficult to achieve. It requires all components to be polished, all robustness checks to pass, and the simulated referees to find no fatal issues. This is by design — the system should flag problems before a journal does, not after. If the gate feels too strict, the alternative is a rejected paper.
Working Paper Format
The working-paper-format.md rule file specifies the LaTeX standard that all papers must follow. The full reference preamble, title page conventions, and bibliography setup are documented there. Here is a summary of what the writer-critic enforces.
Required (Blocking Deductions)
| Violation | Deduction |
|---|---|
| Wrong document class or font size | -5 |
Missing \doublespacing in body |
-5 |
Using natbib instead of biblatex |
-3 |
Missing fancyhdr page number setup |
-2 |
\textbf{} wrapping \title{} |
-3 |
\and between authors instead of \quad |
-3 |
Repeated affiliation text outside \thanks{} |
-3 |
| Missing JEL codes or keywords | -5 |
\hline instead of booktabs rules |
-3 |
| Missing table notes | -5 |
| Missing figure notes | -5 |
hyperref not loaded second-to-last |
-2 |
Missing cleveref after hyperref |
-2 |
Manual Figure~\ref{} instead of \cref{} |
-1 per (max -5) |
Using bibtex instead of biber |
-3 |
Missing microtype |
-2 |
Recommended (Advisory Only)
These are reported but do not produce deductions:
- Missing
lmodernfont - Non-default citation color
- Missing
captionsetup - Missing
hidelinksinhyperref
Agent Enforcement Summary
Different agents check different subsets of the rules. This table shows which invariants each enforcer is responsible for and what happens when it finds a violation.
| Agent | Checks | Action on violation |
|---|---|---|
| writer-critic | INV-1 through INV-13, INV-22 | Deduct per scoring rubric |
| coder-critic | INV-13 through INV-19 | Deduct per scoring rubric |
| storyteller-critic | INV-20, INV-21 | Deduct per scoring rubric |
| verifier | INV-9, INV-10, INV-14, INV-15, INV-16, INV-19 | FAIL (binary — no partial credit) |
| lint hook | INV-14, INV-15, INV-16, INV-19 | Advisory warning (non-blocking) |
Note the overlap: INV-13 is checked by both the writer-critic (paper side) and coder-critic (code side). INV-9, INV-10, INV-14–16, and INV-19 are checked by both the coder-critic (with deductions) and the verifier (with a hard FAIL). This redundancy is intentional — the critic catches issues during development; the verifier catches anything that slipped through at submission time.
Lifecycle Validation
The Orchestrator runs two validation checks around every agent dispatch, defined in lifecycle.md.
PRE-dispatch: Before launching any agent, the Orchestrator reads the agent’s entry in permissions.md and verifies that all required input artifacts exist and contain the expected sections. If validation fails, the agent is not dispatched — instead, the system reports what is missing and suggests which skill to run first.
POST-completion: After an agent finishes, the Orchestrator verifies that the expected output artifacts were created, that required sections are present in those outputs, and that the paired critic has produced a scored report logged in the research journal. If validation fails, the agent is re-dispatched with the specific gaps noted.
The principle is fail fast: never launch an agent with missing inputs and hope it works, and never advance past an agent with missing outputs. Every handoff between agents is validated.