Rules & Invariants

What the system enforces and why

How Rules Work

The system enforces quality through rule files — machine-readable documents that tell each agent what to check, what to flag, and how severely to penalize violations. Every critic reads the relevant rules before reviewing. Every worker reads them before creating.

This page is a reference for all 11 rule files, the 22 content invariants, the severity gradient, and the quality gates. If you fork the repo and change nothing else, these rules still apply.


Rule Files Overview

Eleven files in .claude/rules/ govern the system. Each file has a clear scope — no rule appears in two places.

File Purpose Primary readers
agents.md Worker-critic pairs, separation of powers, 3-strike escalation Orchestrator, all agents
content-invariants.md 22 non-negotiable checks (paper, code, talk, traceability) All critics, verifier, lint hook
content-standards.md Table format, figure format, PDF processing, exploration protocol Coder, writer, coder-critic, writer-critic
lifecycle.md Pre-dispatch and post-completion validation protocol Orchestrator
logging.md Session report, research journal, and pipeline state formats Orchestrator, all agents
meta-governance.md Dual-nature rule (template vs. working project), learning promotion Orchestrator, user
permissions.md Agent registry: phase, dependencies, outputs, escalation targets, quality weights Orchestrator
quality.md Scoring protocol, gate thresholds, severity gradient, deduction scaling Orchestrator, all critics
revision.md R&R cycle: comment classification and routing Orchestrator, writer, coder
workflow.md Plan-first protocol, orchestrator loop, dependency graph, context management Orchestrator, user
working-paper-format.md LaTeX preamble standard, title page format, bibliography setup Writer, writer-critic, verifier

What each file governs

agents.md defines the adversarial pairing system. Every worker has a paired critic. Critics never create artifacts; creators never self-score. When a pair fails to converge after 3 rounds, the file specifies where to escalate (often back to the user for fundamental disagreements).

content-invariants.md lists the 22 non-negotiable rules that every agent checks. These are the “constitution” of the system — violations produce deductions, not suggestions. Critics cite invariant numbers (e.g., “violates INV-3”) in their reports. See Section 2 for the full table.

content-standards.md specifies how tables, figures, and PDFs should look. It covers booktabs formatting, coefficient display conventions (including AEA no-stars policy), preferred R packages (modelsummary, fixest::etable), figure typography (serif fonts, no in-plot titles), and the exploration folder lifecycle.

lifecycle.md defines PRE-dispatch and POST-completion validation. Before dispatching any agent, the Orchestrator checks that required input artifacts exist and contain the right sections. After completion, it verifies outputs were created and the critic score was recorded. This prevents launching agents with missing inputs or advancing past agents with missing outputs.

logging.md standardizes three persistence mechanisms: the session report (append-only human-readable log), the research journal (structured agent-by-agent entries with scores), and the pipeline state JSON (machine-readable status for session recovery).

meta-governance.md addresses the repo’s dual nature as both a working project and a public template. The “one rule” is simple: before committing, ask whether another researcher forking the repo would benefit. Project-specific state goes in .claude/state/; reusable patterns go in the repo itself. It also governs learning promotion — patterns validated across 3+ projects can be elevated to new invariants, but only with user approval.

permissions.md is the agent registry. Each agent’s entry declares its phase, parallel group, required inputs, expected outputs, paired critic, escalation target, and quality weight. The Orchestrator reads this file to determine what to dispatch and when. Adding a new agent means adding an entry here — no other file needs to change.

quality.md defines how individual critic scores aggregate into the project-wide score that gates submission. It also specifies the severity gradient — the same violation costs more in later phases. See Section 3 and Section 4 for details.

revision.md handles the R&R cycle. When real referee reports arrive, /revise classifies each comment as NEW ANALYSIS (routed to coder), CLARIFICATION (routed to writer), DISAGREE (flagged for user), or MINOR (writer handles directly). The system never autonomously pushes back on referees — DISAGREE items always require human judgment.

workflow.md covers the plan-first protocol (plan before coding, save plans to disk, get user approval), the orchestrator’s dependency-driven loop (identify, dispatch, review, verify, score), standalone vs. pipeline mode, and context management (compaction discipline, session recovery).

working-paper-format.md specifies the LaTeX preamble, title page conventions, abstract format, section ordering, and bibliography setup. It distinguishes required items (blocking deductions if violated) from recommended items (advisory, no penalty). The writer-critic checks every paper against this standard.


Content Invariants

These are non-negotiable. Violations produce deductions, not suggestions. Critics cite invariant numbers (e.g., “violates INV-7”) in their reports.

Full Invariant Table

# Rule Category Enforced by Typical deduction
INV-1 Every table has notes (variables, sample, source) Paper writer-critic -5
INV-2 Every figure has a caption with note (what, how to read, source) Paper writer-critic -5
INV-3 No \hline — use booktabs (\toprule, \midrule, \bottomrule). No vertical rules Paper writer-critic -3
INV-4 Significance stars follow journal profile (AEA: no stars) Paper writer-critic -3
INV-5 Abstract is 150 words or fewer Paper writer-critic -5
INV-6 JEL codes and keywords present after abstract Paper writer-critic -5
INV-7 Notation consistent across all sections Paper writer-critic -5
INV-8 Every causal claim has a corresponding identification section Paper writer-critic -10
INV-9 biblatex + biber, not natbib + bibtex Paper writer-critic, verifier -3
INV-10 hyperref loaded second-to-last; cleveref loaded after it Paper writer-critic, verifier -2
INV-11 Numbers in text match tables/figures exactly Paper writer-critic -5
INV-12 No titles inside figures; titles go in LaTeX \caption{} Paper writer-critic -3
INV-13 Scripts export bare tabular — no \begin{table}, \caption, or notes Paper / Code writer-critic, coder-critic -3
INV-14 set.seed() called exactly once, at top, if stochastic Code coder-critic, verifier -5
INV-15 All packages loaded at top, before any computation Code coder-critic, verifier -3
INV-16 No absolute paths; use here() / pathlib.Path / joinpath(@__DIR__) Code coder-critic, verifier -5
INV-17 No growing vectors in loops; pre-allocate or vectorize Code coder-critic -3
INV-18 Output files go to the path specified in CLAUDE.md Code coder-critic -3
INV-19 No prohibited functions (setwd(), rm(list=ls()), install.packages(), attach()) Code coder-critic, verifier -5
INV-20 Talk notation matches paper exactly Talk storyteller-critic -5
INV-21 Every claim on a slide is traceable to the paper Talk storyteller-critic -5
INV-22 Every numerical claim has an entry in the claim-source map Traceability writer-critic -5

Invariants by Category

Paper (INV-1 through INV-13)

Most paper invariants enforce standard economics formatting that journals expect. A few deserve explanation:

INV-4 (significance stars) adapts to the target journal. AEA journals prohibit significance stars entirely — report standard errors and confidence intervals instead. Working papers default to the * p < 0.10, ** p < 0.05, *** p < 0.01 convention. The journal profile determines which convention applies.

INV-7 (notation consistency) catches a common drafting problem: using \(\beta\) for the treatment effect in Section 3 but \(\delta\) in Section 6, or using \(i\) for individuals in one place and firms in another. The same symbol must mean the same thing everywhere.

INV-8 (causal claims need identification) prevents the most common referee complaint in empirical economics. If you write “X causes Y,” you need an identification section explaining how you separate causation from correlation. Descriptive papers should not use causal language at all.

INV-11 (numbers match) catches stale values — when you re-run the analysis and get 0.045 but the text still says 0.052 from a previous run. The writer-critic cross-references every number in the text against the tables and figures.

INV-13 (bare tabular export) enforces the division of labor between code and paper. R/Python/Julia scripts produce the raw table content. The main.tex file wraps it with \begin{table}, \caption{}, and notes. This means you can change a caption without re-running code, and code changes automatically flow into the paper.

Code (INV-14 through INV-19)

Code invariants enforce reproducibility and portability:

INV-14 (single seed) ensures stochastic results are reproducible. One set.seed() call at the top of the master script, not scattered throughout. Multiple seeds make debugging non-deterministic failures nearly impossible.

INV-16 (no absolute paths) is the single most common portability failure. Code that works on your machine but breaks on a collaborator’s because of /Users/yourname/... paths is not reproducible code. Use here::here() in R, pathlib.Path in Python, or joinpath(@__DIR__, ...) in Julia.

INV-19 (prohibited functions) bans functions that break reproducibility or are dangerous in production. setwd() changes global state and breaks when paths differ across machines. rm(list = ls()) destroys the workspace and breaks interactive debugging. install.packages() in scripts is a side effect that should not run automatically.

Talk (INV-20 through INV-21)

Talk invariants enforce fidelity between the presentation and the paper:

INV-20 (notation matches) prevents the common problem of simplifying notation for slides and introducing inconsistencies. If the paper uses \(\hat{\beta}_{ATT}\), the slides must use the same symbol — not \(\beta\) or \(\hat{\tau}\).

INV-21 (traceable claims) prevents “orphan results” that appear on slides but not in the paper. Every number, every claim, every figure on a slide must trace back to the manuscript.

Traceability (INV-22)

INV-22 (claim-source map) is the audit trail. Every numerical claim in the manuscript — “treatment effect of 4.5 percentage points” — must have an entry in quality_reports/claim_source_map_{project}.md that traces it to a specific script line and output file. This makes it possible for a referee (or a future you) to verify any number in the paper by following the chain from text to table to code.


Severity Gradient

Critics do not apply the same harshness at every stage. A missing citation in an early brainstorm is a gentle nudge; the same gap in a near-final manuscript is a serious deduction. The Orchestrator tells each critic which severity level to use.

Phase-Based Severity

Phase Critic stance Rationale
Discovery Encouraging (low) Early ideas need space to develop. Over-criticizing kills exploration.
Strategy Constructive (medium) Identification must be sound, but the critic should suggest alternatives, not just reject.
Execution Strict (high) Code and paper are near-final. Bugs found here are expensive to fix later.
Peer Review Adversarial (maximum) Simulates real referees. The goal is to find every weakness before a journal does.
Presentation Professional (medium-high) Talks should be polished, but scores are advisory — they do not block submission.

The Orchestrator includes the severity level explicitly in the critic’s prompt:

You are reviewing at SEVERITY: HIGH (Execution phase).
Flag all issues. Do not suggest "consider" --- state what must change.

Deduction Scaling

The same violation costs more in later phases. This table shows how four representative issues scale:

Issue Discovery Strategy Execution Peer Review
Missing citation -2 -5 -10 -15
Notation inconsistency -1 -3 -5 -5
Hedging language -3 -5
Missing robustness check -5 -15 -20

The principle: early phases are about getting the direction right. Late phases are about getting the details right. A notation inconsistency in Discovery is a reminder; the same inconsistency in Execution means the writer is not reading their own paper carefully enough.

NoteWhy “adversarial” in Peer Review?

The simulated referees are intentionally harsh because real referees are harsh. A system that gives you 95/100 and then a journal rejects the paper has failed. The adversarial stance exists to catch problems before submission, not after. If the paper survives the system’s referees, it is better prepared for the real ones.


Quality Gates

Quality scores determine what actions are allowed. The system uses a weighted aggregate of individual critic scores, with hard thresholds at each gate.

The Four Gates

Gate Overall score Per-component minimum What it means
Commit >= 80 None Work can be committed to the repository
PR >= 90 None Work can be opened as a pull request
Submission >= 95 >= 80 per component Work can be submitted to a journal
Blocked < 80 Work cannot be committed; issues must be fixed first

How Scores Aggregate

Each critic produces a score from 0 to 100 by starting at 100 and deducting for issues found. These individual scores combine via a weighted average:

Component Weight Source
Literature coverage 10% librarian-critic
Data quality 10% explorer-critic
Identification validity 25% strategist-critic
Theory (when present) 20% theorist-critic
Code quality 15% coder-critic
Paper quality 25% Avg(domain-referee + methods-referee)
Manuscript polish 10% writer-critic
Replication readiness 5% Verifier (pass/fail, mapped to 0 or 100)

Theory weight applies only to papers with a formal theory section (econometric methods, theory + empirics, structural identification). For applied papers using off-the-shelf estimators, the theory row is excluded and remaining weights renormalize.

When Components Are Missing

Not every project has every component. If a component has not been scored:

  • It is excluded from the weighted average
  • Remaining weights are renormalized proportionally
  • Example: no literature review yet — weights become 11%, 28%, 17%, 28%, 11%, 6%

This means you can run partial pipelines (e.g., just Strategy + Execution) and still get meaningful scores.

The Per-Component Minimum

For submission, the aggregate score must be >= 95 and no individual component can be below 80. A perfect literature review cannot compensate for broken identification. A beautiful manuscript cannot compensate for code that does not reproduce.

This rule prevents the “average-up” problem — where excellence in one area masks serious deficiencies in another.

Recovering from a Low Score

When a score falls below the gate threshold:

  1. The Orchestrator identifies which components are blocking
  2. It re-dispatches the relevant worker-critic pair to fix the flagged issues
  3. After fixes, the critic re-reviews (up to 3 rounds per pair)
  4. If 3 rounds fail, the system escalates — often back to the user for a design decision

The goal is convergence, not perfection on the first pass. Most artifacts reach 80+ by round 2.

WarningThe submission gate is intentionally high

A score of 95 is difficult to achieve. It requires all components to be polished, all robustness checks to pass, and the simulated referees to find no fatal issues. This is by design — the system should flag problems before a journal does, not after. If the gate feels too strict, the alternative is a rejected paper.


Working Paper Format

The working-paper-format.md rule file specifies the LaTeX standard that all papers must follow. The full reference preamble, title page conventions, and bibliography setup are documented there. Here is a summary of what the writer-critic enforces.

Required (Blocking Deductions)

Violation Deduction
Wrong document class or font size -5
Missing \doublespacing in body -5
Using natbib instead of biblatex -3
Missing fancyhdr page number setup -2
\textbf{} wrapping \title{} -3
\and between authors instead of \quad -3
Repeated affiliation text outside \thanks{} -3
Missing JEL codes or keywords -5
\hline instead of booktabs rules -3
Missing table notes -5
Missing figure notes -5
hyperref not loaded second-to-last -2
Missing cleveref after hyperref -2
Manual Figure~\ref{} instead of \cref{} -1 per (max -5)
Using bibtex instead of biber -3
Missing microtype -2

Agent Enforcement Summary

Different agents check different subsets of the rules. This table shows which invariants each enforcer is responsible for and what happens when it finds a violation.

Agent Checks Action on violation
writer-critic INV-1 through INV-13, INV-22 Deduct per scoring rubric
coder-critic INV-13 through INV-19 Deduct per scoring rubric
storyteller-critic INV-20, INV-21 Deduct per scoring rubric
verifier INV-9, INV-10, INV-14, INV-15, INV-16, INV-19 FAIL (binary — no partial credit)
lint hook INV-14, INV-15, INV-16, INV-19 Advisory warning (non-blocking)

Note the overlap: INV-13 is checked by both the writer-critic (paper side) and coder-critic (code side). INV-9, INV-10, INV-14–16, and INV-19 are checked by both the coder-critic (with deductions) and the verifier (with a hard FAIL). This redundancy is intentional — the critic catches issues during development; the verifier catches anything that slipped through at submission time.


Lifecycle Validation

The Orchestrator runs two validation checks around every agent dispatch, defined in lifecycle.md.

PRE-dispatch: Before launching any agent, the Orchestrator reads the agent’s entry in permissions.md and verifies that all required input artifacts exist and contain the expected sections. If validation fails, the agent is not dispatched — instead, the system reports what is missing and suggests which skill to run first.

POST-completion: After an agent finishes, the Orchestrator verifies that the expected output artifacts were created, that required sections are present in those outputs, and that the paired critic has produced a scored report logged in the research journal. If validation fails, the agent is re-dispatched with the specific gaps noted.

The principle is fail fast: never launch an agent with missing inputs and hope it works, and never advance past an agent with missing outputs. Every handoff between agents is validated.