Skip to main content
Finally, a Referee Report You'll Actually Read

Finally, a Referee Report You'll Actually Read

·1064 words·5 mins
Hugo Sant'Anna
Author
Hugo Sant’Anna

The problem with agent output
#

Multi-agent systems produce a lot of text. Strategy memos, code audits, referee reports, literature reviews — each one lands as a markdown file in a folder you’ll never open again.

This is the dirty secret of agentic workflows: the agents do the work, but the output sits unread because markdown dumped into quality_reports/ doesn’t invite engagement. You ran /review --peer and got three referee reports. You should be reading them carefully, pushing back on the methods referee, deciding which domain comments to address first. Instead, they’re in a folder. You’ll get to them later. You won’t.

Thariq Shihipar diagnosed this precisely in his “HTML Effectiveness” post. His argument: self-contained HTML files are the most effective format for sharing analysis — no server, no dependencies, no “can you install X to view this.” Just a file that opens in a browser and looks good. He demonstrated it with a single page that’s more persuasive than any slide deck about the same idea.

I took that seriously.

What I built
#

Starting in v4.3.0, every clo-author skill that produces a quality report also generates a self-contained interactive HTML file. No server. No npm. No build step. One .html file with embedded CSS and JavaScript that opens in any browser, works offline, and prints cleanly.

There are six report types. Here are five of them, live — click around.

Peer review
#

Three referee reports — editorial decision, domain referee, methods referee — in a tabbed interface. The editorial tab shows the verdict and key conditions. Click through to each referee for the full report with severity-coded comment cards. This is what /review --peer generates after the markdown.

Quality gate
#

The submission readiness dashboard. A gauge for the aggregate score, colored bars for each gate threshold (commit at 80, PR at 90, submission at 95), a component grid showing every agent’s score, and blocking/passing cards that tell you exactly what’s holding you back. Generated by /submit final.

Literature
#

A filterable, searchable annotated bibliography. Filter by category (theoretical, empirical, methodological), by proximity to your paper (core, adjacent, peripheral), by method (DiD, IV, RDD, structural). Sort by year, author, relevance. Search across titles and annotations. Copy a formatted citation with one click. This is what /discover lit produces — a self-contained Zotero that lives in your project folder.

Strategy review
#

An accordion interface that groups identification issues by phase — assumptions, specification, robustness, threats. Each issue gets a severity card (critical, major, minor) with the strategist-critic’s assessment. From /review --methods and /strategize.

Code audit
#

Score breakdown as a radial gauge, paper-to-code traceability map, sanity check cards, robustness progress tracker. Produced by /review --code and /analyze.

Project dashboard
#

A single page showing your entire research project: which paper sections exist, what data you have, what scripts have been written, what quality scores look like, what the last few commits changed. Run /tools dashboard or /checkpoint and it appears. (Not embedded here — it scans a live project directory.)

The design
#

Every report uses the same design system: templates/html/base/styles.css and components.js. The palette is ivory, clay, slate, and oat — warm and readable, not the clinical blue-and-white of a framework default. Serif headings, monospace labels, generous whitespace. Dark mode that respects prefers-color-scheme and remembers your choice.

The components are deliberately simple. Tabs, collapsibles, severity cards, filter chips, copy buttons. No framework. No virtual DOM. No bundler. The JavaScript is 80 lines of vanilla querySelectorAll and classList.toggle. It loads instantly and will still work in ten years.

I borrowed Thariq’s conviction that HTML is underused as a delivery format. Not as a web app. Not as a dashboard that needs a server. As a file you email, drop in Slack, or open from Finder. A referee report that opens in Safari and looks like someone designed it is more likely to get read than a .md file that opens in a text editor.

How it works
#

Two Python scripts do all the generation:

scripts/generate_dashboard.py scans your project structure — CLAUDE.md, paper/sections/, data/, scripts/, quality_reports/, Bibliography_base.bib, git log — and produces a single HTML file. No dependencies beyond the standard library.

scripts/generate_html_report.py takes markdown report files as input and produces one of the five detail report types. Each subcommand knows the structure of its input markdown (where the scores are, where the comments start, how issues are classified) and transforms it into the appropriate interactive layout.

Both scripts inline the shared CSS and JS, so the output is truly self-contained. You can move the HTML file anywhere — email it, put it on a USB drive, open it in 2035 — and it works.

The skills handle the wiring. When you run /review --peer, the skill dispatches the domain referee and methods referee, collects their markdown reports, runs the editor agent, saves everything to quality_reports/reviews/, and then calls generate_html_report.py peer-review with the three markdown files. The HTML appears next to the markdown. You don’t think about it.

Why this matters
#

The gap between “agent did the work” and “human engaged with the output” is where most multi-agent systems lose their value. You can have the most sophisticated identification critic in the world, but if its output is a markdown file that gets skimmed once and forgotten, you’ve built an expensive rubber stamp.

Interactive HTML closes that gap. The peer review report with tabs makes you read all three referees. The quality gate with a gauge makes you feel whether you’re at 82 or 94. The literature report with filters makes you actually explore the bibliography instead of treating it as a checkbox.

This isn’t a new idea. Thariq’s post made the case better than I could. I just applied it to the specific problem of making agent output readable — and wired it into every skill so it happens automatically.

The reports are the interface between the multi-agent system and the researcher. They should look like it.

Try it
#

# Generate a project dashboard
python3 scripts/generate_dashboard.py --output project_dashboard.html

# Generate from demo data (ships with clo-author)
python3 scripts/generate_html_report.py peer-review \
  quality_reports/demo/methods_referee_report.md \
  quality_reports/demo/methods_referee_report.md \
  quality_reports/demo/editor_decision.md

# Or just run a skill --- the HTML is automatic
# /review --peer
# /discover lit
# /tools dashboard

The demo files in quality_reports/demo/ let you see all six report types without running the full pipeline. They ship with every clo-author install.