%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#f0eee6', 'primaryTextColor': '#141413', 'primaryBorderColor': '#d1cfc5', 'lineColor': '#d97757', 'secondaryColor': '#faf9f5', 'tertiaryColor': '#e6e3da', 'edgeLabelBackground': '#faf9f5', 'clusterBkg': '#f0eee6', 'clusterBorder': '#d1cfc5'}}}%%
flowchart TD
START([Plan approved]):::start --> ID
ID[1. IDENTIFY\nCheck dependency graph]:::step --> DISP[2. DISPATCH\nLaunch worker-critic pairs]:::step
DISP --> REV[3. REVIEW\nCritic scores artifact]:::step
REV -->|score >= 80| VER[4. VERIFY\nCompile, render, run]:::step
REV -->|score < 80| FIX[Worker fixes]:::worker
FIX --> REV
VER --> SC[5. SCORE\nWeighted aggregate]:::step
SC -->|>= threshold| DONE([Present summary]):::approved
SC -->|< threshold| ID
SC -.->|max 5 rounds| DONE
classDef start fill:#f0eee6,stroke:#d97757,color:#141413,stroke-width:2px
classDef step fill:#f0eee6,stroke:#87867f,color:#141413,stroke-width:2px
classDef worker fill:#faf9f5,stroke:#d97757,color:#141413,stroke-width:1px
classDef approved fill:#f0eee6,stroke:#4d7c5a,color:#141413,stroke-width:2px
Architecture
How the system coordinates agents, enforces quality, and learns
System Overview
The system is configured through six layers, loaded in order of precedence.
| Layer | Location | Loaded | Purpose |
|---|---|---|---|
CLAUDE.md |
Project root | Every session | Project identity, folder structure, commands, output organization |
| Rules | .claude/rules/ |
Every session | Quality gates, invariants, permissions, lifecycle, workflow |
| References | .claude/references/ |
On demand | Domain profile, journal profiles, coding standards |
| Skills | .claude/skills/ |
When invoked | Rich folders — routing logic, templates, gotchas, config. The knowledge layer. |
| Agents | .claude/agents/ |
When dispatched | Lean identity files — role, voice, constraints. The execution layer. |
| Hooks | .claude/hooks/ |
Event-driven | Pre-commit lint, pre-compact checklist |
CLAUDE.md and rules are always in context. Everything else loads on demand. For the full layer-by-layer breakdown, see the Customization Guide.
The Skill-Folder Architecture
Skills used to be thin dispatchers — a single SKILL.md file (~100 lines) that routed a command to an agent. All the real knowledge lived inside agent files: checklists, templates, scoring rubrics, reference protocols, packed into monolithic prompts that ran 200–474 lines. Adding knowledge meant editing a massive agent file. Knowledge was locked inside agents, not reusable across contexts. Loading an agent meant loading all its templates even when we needed one.
The architecture inverted. Skills are now rich folders. Agents are lean identity files.
Three-level progressive loading
The system loads context in three stages, each gated by need:
| Level | Trigger | Contents | Budget |
|---|---|---|---|
| 1 — Metadata | Always (skill index) | Name, description, argument-hint | ~100 tokens |
| 2 — SKILL.md body | User invokes the skill | Workflow steps, dispatch logic, modes | < 5k tokens |
| 3 — Bundled files | Agent reads on demand | Templates, gotchas, checklists, references | Unlimited |
Level 1 costs almost nothing — every skill’s frontmatter is visible in the skill index, so the system knows what commands exist without loading their bodies. Level 2 activates only when the user types the slash command. Level 3 loads only the specific files the agent needs for the task at hand.
Skill folder structure
Every skill follows the same layout:
skill-name/
├── SKILL.md # Routing + dispatch logic (Level 2)
├── gotchas.md # Known failure points, edge cases
├── templates/ # Reusable artifacts the agent loads
├── references/ # Domain-specific reference material
└── config/ # Machine-readable configuration
Not every skill uses every subfolder. The /write skill has templates/ (paragraph moves, section templates, cleanup patterns) and references/ (notation protocol). The /review skill has templates/ (scoring rubrics, referee report template) and config/ (disposition pool weights). The /strategize skill has templates/ (strategy memo structure, PAP templates, theory memo) and references/ (PAP interview flow). Each skill carries exactly the knowledge its agents need.
The wiring
When a skill dispatches an agent, it tells the agent which templates to read. The agent file says “read the templates provided by the invoking skill.” This separation means:
- Skills own the WHAT — what to produce, what structure to follow, what pitfalls to avoid.
- Agents own the HOW — voice, reasoning approach, quality standards.
The same agent can serve multiple skills. The coder-critic reviews output from both /analyze and /review --code, loading different checklists each time. The strategist produces both strategy memos and pre-analysis plans, guided by different templates from the same skill folder.
Gotchas as first-class content
Every skill has a gotchas.md — the highest-signal file in the folder. These are known failure modes, accumulated from real pipeline runs. The strategist’s gotchas warn that “staggered adoption does not automatically mean DiD is the right choice.” The writer’s gotchas note that “the cleanup pass can over-correct domain-specific hedging that’s actually appropriate.” The review skill’s gotchas document when cold-read critics produce false positives.
Gotchas are the fastest path to quality improvement. When a failure recurs, it gets a line in gotchas.md. The agent reads it before execution. The failure stops recurring.
What this means in practice
Adding a new template no longer requires editing a 400-line agent prompt. We create a file in templates/, reference it from SKILL.md, and the agent picks it up on next invocation. Knowledge accumulates in skill folders without bloating agent context. Agents stay lean enough to load fast and focused enough to execute well.
The Permission Registry
Every agent’s capabilities, dependencies, and routing are declared in a single file: .claude/rules/permissions.md. The Orchestrator reads this registry before dispatching any agent. No other file hardcodes agent relationships.
Reading an entry
Each entry declares seven fields: PHASE, PARALLEL_GROUP, REQUIRES, PRODUCES, CRITIC, ESCALATION_TARGET, and QUALITY_WEIGHT.
Here is the strategist entry, annotated:
## strategist
- PHASE: Strategy
- PARALLEL_GROUP: strategy
- REQUIRES: quality_reports/literature/{project}/
OR quality_reports/data-assessment/{project}/
- PRODUCES: quality_reports/strategy/{project}/
- Required files: strategy_memo.md
- Required sections: Estimand, Specification, Assumptions,
Robustness Plan, Threats
- CRITIC: strategist-critic
- ESCALATION_TARGET: User --- fundamental design question
- QUALITY_WEIGHT: 25% (identification validity)
The REQUIRES field uses OR logic — the strategist can activate after literature review alone, data assessment alone, or both. The PRODUCES field specifies not just the output file but the sections that must appear in it. POST-completion validation (see Section 4) checks these sections exist before advancing.
Adding an agent
Adding a new agent is a one-file change. Create the agent file in .claude/agents/, add its entry to permissions.md. No other file needs modification — the Orchestrator discovers agents by reading the registry.
The Orchestrator Loop
After a plan is approved, the Orchestrator takes over autonomously. It runs a 5-step loop until quality gates are met or limits are reached.
Step 1: IDENTIFY
The Orchestrator reads permissions.md and checks which agents have their REQUIRES satisfied. Agents activate when dependencies are met, not by phase number. If you already have data and a draft paper, you can enter at Strategy or Peer Review — the Orchestrator checks REQUIRES, not sequence.
Step 2: DISPATCH
Worker-critic pairs launch for all agents whose REQUIRES are satisfied. Agents in the same PARALLEL_GROUP run concurrently — for example, the librarian and explorer both belong to the discovery group and run simultaneously. Before dispatch, the Orchestrator runs PRE-validation (see Section 4).
Step 3: REVIEW
The paired critic evaluates the worker’s artifact against its rubric. If the score is below 80, the worker revises and the critic re-reviews. This inner loop runs up to 3 rounds per pair (see Section 5).
Step 4: VERIFY
Compile LaTeX, run scripts, check that output files exist at expected paths. If verification fails, the worker fixes and re-verifies (max 2 attempts).
Step 5: SCORE
Compute the weighted aggregate across all scored components. If the score meets the gate threshold, present the summary to the user. If not, loop back to Step 1 to identify the blocking components.
Limits
| Limit | Value | What happens |
|---|---|---|
| Worker-critic rounds | 3 | Escalate per ESCALATION_TARGET |
| Overall loop rounds | 5 | Present with remaining issues |
| Verification retries | 2 | Report failure to user |
The system never loops indefinitely.
Modes
Pipeline mode (/new-project): Full orchestration with dependency graph, parallel dispatch, and quality gates.
Standalone mode (/review, /write, /analyze, etc.): Skip dependency checks, dispatch the requested agent(s) directly, return results.
Simplified mode (R scripts, explorations): Plan, implement, run, verify quality >= 80. No multi-agent reviews.
Lifecycle Validation
The Orchestrator runs validation checks at two points: before dispatching an agent and after an agent completes. This implements a fail-fast principle — missing inputs are caught before work begins, missing outputs are caught before advancement.
PRE-dispatch checks
Before dispatching any agent, the Orchestrator reads its permissions.md entry and verifies that REQUIRES artifacts exist (file paths globbed, score gates checked against the research journal, directories confirmed non-empty) and that required sections are present in structured artifacts.
If PRE validation fails, the agent is not dispatched.
POST-completion checks
After an agent completes, the Orchestrator verifies that PRODUCES artifacts exist, required sections are present, and the critic score is recorded in the research journal.
If POST validation fails, the pipeline does not advance. The Orchestrator re-dispatches the agent with the specific gaps noted.
What failure looks like
Cannot dispatch [coder]: missing quality_reports/strategy/{project}/strategy_memo.md
→ Run /strategize first to produce the strategy memo.
No agent launches with missing inputs. No agent advances with missing outputs. The Orchestrator includes the validation failure in its report to the user.
Worker-Critic Protocol
Every agent that creates an artifact has a paired critic that evaluates it. For the full list of pairs and what each reviews, see Agents.
The 3-round loop
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#f0eee6', 'primaryTextColor': '#141413', 'primaryBorderColor': '#d1cfc5', 'lineColor': '#d97757', 'secondaryColor': '#faf9f5', 'tertiaryColor': '#e6e3da', 'edgeLabelBackground': '#faf9f5', 'clusterBkg': '#f0eee6', 'clusterBorder': '#d1cfc5'}}}%%
flowchart TD
W[Worker creates artifact]:::worker --> C[Critic reviews cold-read]:::critic
C -->|score >= 80| A([APPROVED]):::approved
C -->|score < 80| F[Worker fixes issues]:::worker
F --> C2[Critic re-reviews cold-read]:::critic
C2 -->|score >= 80| A
C2 -->|score < 80, round < 3| F
C2 -->|score < 80, round = 3| ESC([ESCALATE]):::escalate
classDef worker fill:#f0eee6,stroke:#d97757,color:#141413,stroke-width:2px
classDef critic fill:#f0eee6,stroke:#788c5d,color:#141413,stroke-width:2px
classDef approved fill:#f0eee6,stroke:#4d7c5a,color:#141413,stroke-width:2px
classDef escalate fill:#f0eee6,stroke:#b33d3d,color:#141413,stroke-width:2px
Cold-read enforcement
On every round, the critic receives only: the artifact, its scoring rubric, the severity level, and the content invariants from Rules & Invariants. It does not receive what round this is, what the worker struggled with, prior critic reports, or the research journal. This prevents the critic from softening its evaluation because it knows the worker has already tried twice.
Round-aware feedback (what to fix and how) flows from the Orchestrator to the worker, never through the critics.
Separation of powers
Critics never create artifacts. Creators never self-score. If a critic invocation produces a file in scripts/, paper/, or paper/talks/, the Orchestrator flags it as a violation. If a creator reports its own score, the Orchestrator discards it and dispatches the critic.
A critic who fixes its own findings has incentive to find only fixable issues. Separation keeps criticism honest.
Escalation routing
When a pair hits 3 rounds without converging, the Orchestrator reads the agent’s ESCALATION_TARGET from permissions.md. Some examples from the registry:
| Pair | Escalation Target | Rationale |
|---|---|---|
| coder + coder-critic | strategist-critic | Strategy memo may not be implementable |
| writer + writer-critic | Orchestrator | Needs structural rewrite, not polish |
| strategist + strategist-critic | User | Fundamental design question |
| theorist + theorist-critic | User | Proof-level disagreement |
Post-escalation, the worker starts fresh from the escalation target’s decision — not from its previous attempt.
Dual-Critic Dispatch
For artifacts that gate major phase transitions, the Orchestrator dispatches two critics with different evaluation lenses. This catches blind spots that a single-dimension review misses.
Gate artifacts
| Artifact | Primary Critic | Secondary Lens |
|---|---|---|
| Strategy memo | strategist-critic (identification) | coder-critic (implementability) |
| Main results (code) | coder-critic (code quality) | strategist-critic (strategy alignment) |
| Final manuscript | writer-critic (prose quality) | strategist-critic (claims-strategy match) |
Synthesis rules
- Both >= 80: Pass. Artifact advances.
- One < 80: Worker fixes issues from the failing critic. Both critics re-evaluate.
- Both < 80: Escalate per the pair’s
ESCALATION_TARGET.
Independence
Neither critic sees the other’s report. The Orchestrator synthesizes both scores independently. This is the same cold-read principle applied across critics, not just across rounds.
Scope
Dual-critic dispatch activates only for gate artifacts in the full pipeline (/new-project). Standalone skill invocations (/review, /write, etc.) use single critics.
Pipeline State and Session Recovery
The state file
The Orchestrator maintains structured pipeline state in quality_reports/pipeline_state.json. The schema tracks:
{
"project": "project-name",
"phase": "Execution",
"last_updated": "2026-05-09T14:30:00Z",
"agents_completed": [
{ "agent": "strategist", "critic": "strategist-critic",
"score": 88, "rounds": 1, "artifact": "strategy_memo.md" }
],
"agents_in_progress": [
{ "agent": "coder", "critic": "coder-critic",
"current_round": 2, "max_rounds": 3,
"last_score": 72, "issues_remaining": ["missing robustness"] }
],
"agents_pending": ["writer"],
"overall_score": null,
"blocked_by": null
}Write triggers
The Orchestrator updates the state file at four points:
- After agent completion — move from
agents_in_progresstoagents_completed - After critic score — update the score and round count
- After phase transition — update the
phasefield - After escalation — set
blocked_by
Read triggers
On session start (new session or after context compression), the Orchestrator reads pipeline_state.json as the first action. The structured state is faster and more reliable than reconstructing from prose logs.
Session recovery order
| Step | Action | Source |
|---|---|---|
| 0 | Read pipeline state | quality_reports/pipeline_state.json |
| 1 | Read checkpoint artifacts | SESSION_REPORT.md, quality_reports/research_journal.md |
| 2 | Read project config + plan | CLAUDE.md, most recent file in quality_reports/plans/ |
| 3 | Check recent changes | git log --oneline -10, git diff |
Execution traces
After completing a multi-agent skill (/new-project, /analyze, /write full), the Orchestrator generates an execution trace from the pipeline state and saves it to quality_reports/traces/. The trace includes a mermaid graph of the agent invocation sequence, a table of rounds and scores per agent, any escalation events, and a pipeline summary with totals.
Relationship to the research journal
pipeline_state.json is structured, machine-readable, and overwritten. research_journal.md is narrative, append-only, and human-readable. Both are maintained — the state file is for fast recovery, the journal is for understanding what happened and why.
Scoring Protocol
Weighted aggregation
Each agent’s quality weight is declared in its permissions.md entry. The Orchestrator reads the registry and computes:
\[\text{Overall} = \sum_i w_i \times \text{score}_i\]
where \(w_i\) is the agent’s QUALITY_WEIGHT and \(\text{score}_i\) is the critic’s final score (0–100). Current weights:
| Component | Weight | Source |
|---|---|---|
| Literature coverage | 10% | librarian-critic |
| Data quality | 10% | explorer-critic |
| Identification validity | 25% | strategist-critic |
| Theory (when present) | 20% | theorist-critic |
| Code quality | 15% | coder-critic |
| Paper quality | 25% | Avg(domain-referee, methods-referee) |
| Manuscript polish | 10% | writer-critic |
| Replication readiness | 5% | verifier (0 or 100) |
Missing components
If a component has not been scored, it is excluded and remaining weights renormalize. The theory weight applies only when the theorist agent was dispatched (papers with formal theory sections). For applied papers using off-the-shelf estimators, theory is excluded automatically.
Gate thresholds
| Gate | Overall Score | Per-Component Min | Consequence |
|---|---|---|---|
| Commit | >= 80 | None | Allowed |
| PR | >= 90 | None | Allowed |
| Submission | >= 95 | >= 80 | Allowed |
| Below 80 | < 80 | — | Blocked |
Severity gradient
Critics calibrate harshness by phase. The Orchestrator includes the severity level in the critic’s prompt. For the full deduction scaling table, see Rules & Invariants.
| Phase | Severity | Stance |
|---|---|---|
| Discovery | Low | Encouraging — early ideas need space |
| Strategy | Medium | Constructive — sound identification, but suggest alternatives |
| Execution | High | Strict — bugs are costly at this stage |
| Peer Review | Maximum | Adversarial — simulates real referees |
| Presentation | Medium-high | Professional — polished but advisory |
Plan-First Protocol
For any non-trivial task, the system enters plan mode before implementation.
The protocol
- Enter plan mode
- Check memory — read
[LEARN]entries relevant to the task - Requirements spec (for complex/ambiguous tasks) — clarify ambiguities, mark requirements as MUST / SHOULD / MAY, declare clarity status as CLEAR / ASSUMED / BLOCKED
- Draft the plan — what changes, which files, in what order
- Save to disk —
quality_reports/plans/YYYY-MM-DD_description.md - Present to user — wait for approval
- Exit plan mode — only after approval
- Save session log — capture goal and context while fresh
- Implement — Orchestrator takes over via the dependency-driven loop
Plans survive context compression because they are saved to disk. The requirements spec step is optional — skip it for clear, specific tasks; use it for ambiguous or multi-file work.
Simplified mode
For standalone R scripts, simulations, and explorations: plan, implement, run, verify quality >= 80. No multi-agent reviews.
“Just do it” mode
When the user says “just do it” or “handle it”: skip the final approval pause, auto-commit if score >= 80, but still run the full verify-review-fix loop.
Learning Loop
After completing a multi-agent skill, the Orchestrator reviews the execution trace for three pattern types.
Pattern detection
| Pattern | Signal | Example |
|---|---|---|
HIGH-PERF |
Agent-critic pair scored >= 90 on first pass | Strategist produced a clean memo that passed on round 1 |
FRICTION |
Pair hit 3 strikes | Coder-critic kept failing on numerical discipline — prompt may need refinement |
ESCALATION |
Question escalated to user | Could the system have resolved it with better context or rules? |
How it works
The Orchestrator surfaces findings as “Suggested Learnings” at the end of the pipeline summary:
### Suggested Learnings
- [HIGH-PERF] Strategist memo structure matched coder needs ---
consider templating the robustness section
- [FRICTION] Coder failed numerical discipline 3 rounds ---
add float-comparison examples to coding standards
- [ESCALATION] User resolved data-source ambiguity ---
add data source preferences to domain profile
The Orchestrator never auto-appends to memory. It presents suggestions. The user approves or rejects. Approved learnings are saved via the auto-memory system.
Learning promotion
When a pattern appears across 3 or more projects, it is a candidate for promotion to a template or rule file — moving from project-specific memory to the system’s permanent configuration. This is how the architecture improves over time without accumulating noise.