Architecture

How the system coordinates agents, enforces quality, and learns

System Overview

The system is configured through six layers, loaded in order of precedence.

Layer Location Loaded Purpose
CLAUDE.md Project root Every session Project identity, folder structure, commands, output organization
Rules .claude/rules/ Every session Quality gates, invariants, permissions, lifecycle, workflow
References .claude/references/ On demand Domain profile, journal profiles, coding standards
Skills .claude/skills/ When invoked Rich folders — routing logic, templates, gotchas, config. The knowledge layer.
Agents .claude/agents/ When dispatched Lean identity files — role, voice, constraints. The execution layer.
Hooks .claude/hooks/ Event-driven Pre-commit lint, pre-compact checklist

CLAUDE.md and rules are always in context. Everything else loads on demand. For the full layer-by-layer breakdown, see the Customization Guide.


The Skill-Folder Architecture

Skills used to be thin dispatchers — a single SKILL.md file (~100 lines) that routed a command to an agent. All the real knowledge lived inside agent files: checklists, templates, scoring rubrics, reference protocols, packed into monolithic prompts that ran 200–474 lines. Adding knowledge meant editing a massive agent file. Knowledge was locked inside agents, not reusable across contexts. Loading an agent meant loading all its templates even when we needed one.

The architecture inverted. Skills are now rich folders. Agents are lean identity files.

Three-level progressive loading

The system loads context in three stages, each gated by need:

Level Trigger Contents Budget
1 — Metadata Always (skill index) Name, description, argument-hint ~100 tokens
2 — SKILL.md body User invokes the skill Workflow steps, dispatch logic, modes < 5k tokens
3 — Bundled files Agent reads on demand Templates, gotchas, checklists, references Unlimited

Level 1 costs almost nothing — every skill’s frontmatter is visible in the skill index, so the system knows what commands exist without loading their bodies. Level 2 activates only when the user types the slash command. Level 3 loads only the specific files the agent needs for the task at hand.

Skill folder structure

Every skill follows the same layout:

skill-name/
├── SKILL.md          # Routing + dispatch logic (Level 2)
├── gotchas.md        # Known failure points, edge cases
├── templates/        # Reusable artifacts the agent loads
├── references/       # Domain-specific reference material
└── config/           # Machine-readable configuration

Not every skill uses every subfolder. The /write skill has templates/ (paragraph moves, section templates, cleanup patterns) and references/ (notation protocol). The /review skill has templates/ (scoring rubrics, referee report template) and config/ (disposition pool weights). The /strategize skill has templates/ (strategy memo structure, PAP templates, theory memo) and references/ (PAP interview flow). Each skill carries exactly the knowledge its agents need.

The wiring

When a skill dispatches an agent, it tells the agent which templates to read. The agent file says “read the templates provided by the invoking skill.” This separation means:

  • Skills own the WHAT — what to produce, what structure to follow, what pitfalls to avoid.
  • Agents own the HOW — voice, reasoning approach, quality standards.

The same agent can serve multiple skills. The coder-critic reviews output from both /analyze and /review --code, loading different checklists each time. The strategist produces both strategy memos and pre-analysis plans, guided by different templates from the same skill folder.

Gotchas as first-class content

Every skill has a gotchas.md — the highest-signal file in the folder. These are known failure modes, accumulated from real pipeline runs. The strategist’s gotchas warn that “staggered adoption does not automatically mean DiD is the right choice.” The writer’s gotchas note that “the cleanup pass can over-correct domain-specific hedging that’s actually appropriate.” The review skill’s gotchas document when cold-read critics produce false positives.

Gotchas are the fastest path to quality improvement. When a failure recurs, it gets a line in gotchas.md. The agent reads it before execution. The failure stops recurring.

What this means in practice

Adding a new template no longer requires editing a 400-line agent prompt. We create a file in templates/, reference it from SKILL.md, and the agent picks it up on next invocation. Knowledge accumulates in skill folders without bloating agent context. Agents stay lean enough to load fast and focused enough to execute well.


The Permission Registry

Every agent’s capabilities, dependencies, and routing are declared in a single file: .claude/rules/permissions.md. The Orchestrator reads this registry before dispatching any agent. No other file hardcodes agent relationships.

Reading an entry

Each entry declares seven fields: PHASE, PARALLEL_GROUP, REQUIRES, PRODUCES, CRITIC, ESCALATION_TARGET, and QUALITY_WEIGHT.

Here is the strategist entry, annotated:

## strategist
- PHASE: Strategy
- PARALLEL_GROUP: strategy
- REQUIRES: quality_reports/literature/{project}/
             OR quality_reports/data-assessment/{project}/
- PRODUCES: quality_reports/strategy/{project}/
    - Required files: strategy_memo.md
    - Required sections: Estimand, Specification, Assumptions,
                         Robustness Plan, Threats
- CRITIC: strategist-critic
- ESCALATION_TARGET: User --- fundamental design question
- QUALITY_WEIGHT: 25% (identification validity)

The REQUIRES field uses OR logic — the strategist can activate after literature review alone, data assessment alone, or both. The PRODUCES field specifies not just the output file but the sections that must appear in it. POST-completion validation (see Section 4) checks these sections exist before advancing.

Adding an agent

Adding a new agent is a one-file change. Create the agent file in .claude/agents/, add its entry to permissions.md. No other file needs modification — the Orchestrator discovers agents by reading the registry.


The Orchestrator Loop

After a plan is approved, the Orchestrator takes over autonomously. It runs a 5-step loop until quality gates are met or limits are reached.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#f0eee6', 'primaryTextColor': '#141413', 'primaryBorderColor': '#d1cfc5', 'lineColor': '#d97757', 'secondaryColor': '#faf9f5', 'tertiaryColor': '#e6e3da', 'edgeLabelBackground': '#faf9f5', 'clusterBkg': '#f0eee6', 'clusterBorder': '#d1cfc5'}}}%%
flowchart TD
    START([Plan approved]):::start --> ID
    ID[1. IDENTIFY\nCheck dependency graph]:::step --> DISP[2. DISPATCH\nLaunch worker-critic pairs]:::step
    DISP --> REV[3. REVIEW\nCritic scores artifact]:::step
    REV -->|score >= 80| VER[4. VERIFY\nCompile, render, run]:::step
    REV -->|score < 80| FIX[Worker fixes]:::worker
    FIX --> REV
    VER --> SC[5. SCORE\nWeighted aggregate]:::step
    SC -->|>= threshold| DONE([Present summary]):::approved
    SC -->|< threshold| ID
    SC -.->|max 5 rounds| DONE

    classDef start fill:#f0eee6,stroke:#d97757,color:#141413,stroke-width:2px
    classDef step fill:#f0eee6,stroke:#87867f,color:#141413,stroke-width:2px
    classDef worker fill:#faf9f5,stroke:#d97757,color:#141413,stroke-width:1px
    classDef approved fill:#f0eee6,stroke:#4d7c5a,color:#141413,stroke-width:2px

Step 1: IDENTIFY

The Orchestrator reads permissions.md and checks which agents have their REQUIRES satisfied. Agents activate when dependencies are met, not by phase number. If you already have data and a draft paper, you can enter at Strategy or Peer Review — the Orchestrator checks REQUIRES, not sequence.

Step 2: DISPATCH

Worker-critic pairs launch for all agents whose REQUIRES are satisfied. Agents in the same PARALLEL_GROUP run concurrently — for example, the librarian and explorer both belong to the discovery group and run simultaneously. Before dispatch, the Orchestrator runs PRE-validation (see Section 4).

Step 3: REVIEW

The paired critic evaluates the worker’s artifact against its rubric. If the score is below 80, the worker revises and the critic re-reviews. This inner loop runs up to 3 rounds per pair (see Section 5).

Step 4: VERIFY

Compile LaTeX, run scripts, check that output files exist at expected paths. If verification fails, the worker fixes and re-verifies (max 2 attempts).

Step 5: SCORE

Compute the weighted aggregate across all scored components. If the score meets the gate threshold, present the summary to the user. If not, loop back to Step 1 to identify the blocking components.

Limits

Limit Value What happens
Worker-critic rounds 3 Escalate per ESCALATION_TARGET
Overall loop rounds 5 Present with remaining issues
Verification retries 2 Report failure to user

The system never loops indefinitely.

Modes

Pipeline mode (/new-project): Full orchestration with dependency graph, parallel dispatch, and quality gates.

Standalone mode (/review, /write, /analyze, etc.): Skip dependency checks, dispatch the requested agent(s) directly, return results.

Simplified mode (R scripts, explorations): Plan, implement, run, verify quality >= 80. No multi-agent reviews.


Lifecycle Validation

The Orchestrator runs validation checks at two points: before dispatching an agent and after an agent completes. This implements a fail-fast principle — missing inputs are caught before work begins, missing outputs are caught before advancement.

PRE-dispatch checks

Before dispatching any agent, the Orchestrator reads its permissions.md entry and verifies that REQUIRES artifacts exist (file paths globbed, score gates checked against the research journal, directories confirmed non-empty) and that required sections are present in structured artifacts.

If PRE validation fails, the agent is not dispatched.

POST-completion checks

After an agent completes, the Orchestrator verifies that PRODUCES artifacts exist, required sections are present, and the critic score is recorded in the research journal.

If POST validation fails, the pipeline does not advance. The Orchestrator re-dispatches the agent with the specific gaps noted.

What failure looks like

Cannot dispatch [coder]: missing quality_reports/strategy/{project}/strategy_memo.md
  → Run /strategize first to produce the strategy memo.

No agent launches with missing inputs. No agent advances with missing outputs. The Orchestrator includes the validation failure in its report to the user.


Worker-Critic Protocol

Every agent that creates an artifact has a paired critic that evaluates it. For the full list of pairs and what each reviews, see Agents.

The 3-round loop

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#f0eee6', 'primaryTextColor': '#141413', 'primaryBorderColor': '#d1cfc5', 'lineColor': '#d97757', 'secondaryColor': '#faf9f5', 'tertiaryColor': '#e6e3da', 'edgeLabelBackground': '#faf9f5', 'clusterBkg': '#f0eee6', 'clusterBorder': '#d1cfc5'}}}%%
flowchart TD
    W[Worker creates artifact]:::worker --> C[Critic reviews cold-read]:::critic
    C -->|score >= 80| A([APPROVED]):::approved
    C -->|score < 80| F[Worker fixes issues]:::worker
    F --> C2[Critic re-reviews cold-read]:::critic
    C2 -->|score >= 80| A
    C2 -->|score < 80, round < 3| F
    C2 -->|score < 80, round = 3| ESC([ESCALATE]):::escalate

    classDef worker fill:#f0eee6,stroke:#d97757,color:#141413,stroke-width:2px
    classDef critic fill:#f0eee6,stroke:#788c5d,color:#141413,stroke-width:2px
    classDef approved fill:#f0eee6,stroke:#4d7c5a,color:#141413,stroke-width:2px
    classDef escalate fill:#f0eee6,stroke:#b33d3d,color:#141413,stroke-width:2px

Cold-read enforcement

On every round, the critic receives only: the artifact, its scoring rubric, the severity level, and the content invariants from Rules & Invariants. It does not receive what round this is, what the worker struggled with, prior critic reports, or the research journal. This prevents the critic from softening its evaluation because it knows the worker has already tried twice.

Round-aware feedback (what to fix and how) flows from the Orchestrator to the worker, never through the critics.

Separation of powers

Critics never create artifacts. Creators never self-score. If a critic invocation produces a file in scripts/, paper/, or paper/talks/, the Orchestrator flags it as a violation. If a creator reports its own score, the Orchestrator discards it and dispatches the critic.

A critic who fixes its own findings has incentive to find only fixable issues. Separation keeps criticism honest.

Escalation routing

When a pair hits 3 rounds without converging, the Orchestrator reads the agent’s ESCALATION_TARGET from permissions.md. Some examples from the registry:

Pair Escalation Target Rationale
coder + coder-critic strategist-critic Strategy memo may not be implementable
writer + writer-critic Orchestrator Needs structural rewrite, not polish
strategist + strategist-critic User Fundamental design question
theorist + theorist-critic User Proof-level disagreement

Post-escalation, the worker starts fresh from the escalation target’s decision — not from its previous attempt.


Dual-Critic Dispatch

For artifacts that gate major phase transitions, the Orchestrator dispatches two critics with different evaluation lenses. This catches blind spots that a single-dimension review misses.

Gate artifacts

Artifact Primary Critic Secondary Lens
Strategy memo strategist-critic (identification) coder-critic (implementability)
Main results (code) coder-critic (code quality) strategist-critic (strategy alignment)
Final manuscript writer-critic (prose quality) strategist-critic (claims-strategy match)

Synthesis rules

  • Both >= 80: Pass. Artifact advances.
  • One < 80: Worker fixes issues from the failing critic. Both critics re-evaluate.
  • Both < 80: Escalate per the pair’s ESCALATION_TARGET.

Independence

Neither critic sees the other’s report. The Orchestrator synthesizes both scores independently. This is the same cold-read principle applied across critics, not just across rounds.

Scope

Dual-critic dispatch activates only for gate artifacts in the full pipeline (/new-project). Standalone skill invocations (/review, /write, etc.) use single critics.


Pipeline State and Session Recovery

The state file

The Orchestrator maintains structured pipeline state in quality_reports/pipeline_state.json. The schema tracks:

{
  "project": "project-name",
  "phase": "Execution",
  "last_updated": "2026-05-09T14:30:00Z",
  "agents_completed": [
    { "agent": "strategist", "critic": "strategist-critic",
      "score": 88, "rounds": 1, "artifact": "strategy_memo.md" }
  ],
  "agents_in_progress": [
    { "agent": "coder", "critic": "coder-critic",
      "current_round": 2, "max_rounds": 3,
      "last_score": 72, "issues_remaining": ["missing robustness"] }
  ],
  "agents_pending": ["writer"],
  "overall_score": null,
  "blocked_by": null
}

Write triggers

The Orchestrator updates the state file at four points:

  1. After agent completion — move from agents_in_progress to agents_completed
  2. After critic score — update the score and round count
  3. After phase transition — update the phase field
  4. After escalation — set blocked_by

Read triggers

On session start (new session or after context compression), the Orchestrator reads pipeline_state.json as the first action. The structured state is faster and more reliable than reconstructing from prose logs.

Session recovery order

Step Action Source
0 Read pipeline state quality_reports/pipeline_state.json
1 Read checkpoint artifacts SESSION_REPORT.md, quality_reports/research_journal.md
2 Read project config + plan CLAUDE.md, most recent file in quality_reports/plans/
3 Check recent changes git log --oneline -10, git diff

Execution traces

After completing a multi-agent skill (/new-project, /analyze, /write full), the Orchestrator generates an execution trace from the pipeline state and saves it to quality_reports/traces/. The trace includes a mermaid graph of the agent invocation sequence, a table of rounds and scores per agent, any escalation events, and a pipeline summary with totals.

Relationship to the research journal

pipeline_state.json is structured, machine-readable, and overwritten. research_journal.md is narrative, append-only, and human-readable. Both are maintained — the state file is for fast recovery, the journal is for understanding what happened and why.


Scoring Protocol

Weighted aggregation

Each agent’s quality weight is declared in its permissions.md entry. The Orchestrator reads the registry and computes:

\[\text{Overall} = \sum_i w_i \times \text{score}_i\]

where \(w_i\) is the agent’s QUALITY_WEIGHT and \(\text{score}_i\) is the critic’s final score (0–100). Current weights:

Component Weight Source
Literature coverage 10% librarian-critic
Data quality 10% explorer-critic
Identification validity 25% strategist-critic
Theory (when present) 20% theorist-critic
Code quality 15% coder-critic
Paper quality 25% Avg(domain-referee, methods-referee)
Manuscript polish 10% writer-critic
Replication readiness 5% verifier (0 or 100)

Missing components

If a component has not been scored, it is excluded and remaining weights renormalize. The theory weight applies only when the theorist agent was dispatched (papers with formal theory sections). For applied papers using off-the-shelf estimators, theory is excluded automatically.

Gate thresholds

Gate Overall Score Per-Component Min Consequence
Commit >= 80 None Allowed
PR >= 90 None Allowed
Submission >= 95 >= 80 Allowed
Below 80 < 80 Blocked

Severity gradient

Critics calibrate harshness by phase. The Orchestrator includes the severity level in the critic’s prompt. For the full deduction scaling table, see Rules & Invariants.

Phase Severity Stance
Discovery Low Encouraging — early ideas need space
Strategy Medium Constructive — sound identification, but suggest alternatives
Execution High Strict — bugs are costly at this stage
Peer Review Maximum Adversarial — simulates real referees
Presentation Medium-high Professional — polished but advisory

Plan-First Protocol

For any non-trivial task, the system enters plan mode before implementation.

The protocol

  1. Enter plan mode
  2. Check memory — read [LEARN] entries relevant to the task
  3. Requirements spec (for complex/ambiguous tasks) — clarify ambiguities, mark requirements as MUST / SHOULD / MAY, declare clarity status as CLEAR / ASSUMED / BLOCKED
  4. Draft the plan — what changes, which files, in what order
  5. Save to diskquality_reports/plans/YYYY-MM-DD_description.md
  6. Present to user — wait for approval
  7. Exit plan mode — only after approval
  8. Save session log — capture goal and context while fresh
  9. Implement — Orchestrator takes over via the dependency-driven loop

Plans survive context compression because they are saved to disk. The requirements spec step is optional — skip it for clear, specific tasks; use it for ambiguous or multi-file work.

Simplified mode

For standalone R scripts, simulations, and explorations: plan, implement, run, verify quality >= 80. No multi-agent reviews.

“Just do it” mode

When the user says “just do it” or “handle it”: skip the final approval pause, auto-commit if score >= 80, but still run the full verify-review-fix loop.


Learning Loop

After completing a multi-agent skill, the Orchestrator reviews the execution trace for three pattern types.

Pattern detection

Pattern Signal Example
HIGH-PERF Agent-critic pair scored >= 90 on first pass Strategist produced a clean memo that passed on round 1
FRICTION Pair hit 3 strikes Coder-critic kept failing on numerical discipline — prompt may need refinement
ESCALATION Question escalated to user Could the system have resolved it with better context or rules?

How it works

The Orchestrator surfaces findings as “Suggested Learnings” at the end of the pipeline summary:

### Suggested Learnings
- [HIGH-PERF] Strategist memo structure matched coder needs ---
  consider templating the robustness section
- [FRICTION] Coder failed numerical discipline 3 rounds ---
  add float-comparison examples to coding standards
- [ESCALATION] User resolved data-source ambiguity ---
  add data source preferences to domain profile

The Orchestrator never auto-appends to memory. It presents suggestions. The user approves or rejects. Approved learnings are saved via the auto-memory system.

Learning promotion

When a pattern appears across 3 or more projects, it is a candidate for promotion to a template or rule file — moving from project-specific memory to the system’s permanent configuration. This is how the architecture improves over time without accumulating noise.