The problem
A coding agent that grades its own work declares done early. Not maliciously — the loop just doesn't have a reason to keep going once the visible diff "looks right." Per-file code smells slip in. Scope drifts a file or two beyond what was asked. A UI change "looks fine" without anyone having clicked the button. The model is doing its best, but its best is one judgment call per turn, and that call is the same one that just made the changes.
The fix isn't to write a better agent. The fix is a second pair of eyes the agent can't bypass, at the two moments drift happens: the instant a file is written (line-level drift), and the instant the turn ends (delivery-level drift). Catch both and the agent stays in lock step with the standards the repo has agreed on, rather than drifting away by degrees over a long session.
That's what this hook system does. It's three bash scripts wired through .claude/settings.json against three Claude Code lifecycle events. Each script is small. The composition is what makes the loop tight.
This article is about the per-write and per-turn review hooks in this portfolio repo. For the related pattern of using hooks to inject long-term memory into prompts, see memory-pkg. For the graph-backed map of a codebase that the agent queries instead of re-reading, see sylphie-pkg.
What the system is
Three hooks, three lifecycle events, one shared state directory under .claude/. The whole wiring fits in 40 lines of settings.json (.claude/settings.json:1-40):
-
snapshot-baseline.shruns on everyUserPromptSubmit. It hashes the current code-file diff and writes the hash to.claude/.turn-baseline-hash. Cheap (10 s timeout,settings.json:35), silent, no LLM. Its only job is to mark where this turn started. (.claude/hooks/snapshot-baseline.sh) -
review-on-write.shruns on everyPostToolUsewhose tool matchesWrite|Edit|MultiEdit. It spawns a Sonnet reviewer with a single instruction: read this one file and decideAPPROVEDorISSUES. Silent onAPPROVED; injectsISSUESback into the parent session asadditionalContext. 90 s timeout (settings.json:24). (.claude/hooks/review-on-write.sh) -
review-on-stop.shruns on everyStop— i.e., the moment the agent thinks it's done. It compares the current diff hash against the baseline. If this turn produced no code changes, the hook exits silently. Otherwise it spawns a Sonnet reviewer with tools — git, the session transcript, Playwright — and verifies the cumulative diff actually delivers what the user asked for. 300 s timeout (settings.json:9). (.claude/hooks/review-on-stop.sh)
Two files give the reviewers continuity across their stateless claude -p invocations: .claude/review-overrides.md (committed; intentional choices the reviewer should not re-flag) and .claude/review-log/YYYY-MM-DD.md (gitignored; auto-appended record of every review).
Fail-safe, not fail-open
Compare these review hooks with the injection hook described in the memory-pkg write-up. That hook fails open. If the retrieval CLI errors, if the DB is down, if anything goes wrong, the hook silently exits and the user's prompt proceeds untouched. Memory enrichment is a nice-to-have; not blocking the user is the priority.
The review hooks fail safe. If the reviewer fails to spawn, the failure is captured and emitted as the review output: [reviewer failed: ...]. (review-on-write.sh:71, review-on-stop.sh:148) The user sees that something went wrong instead of silently shipping unreviewed work. The hash that records "we already reviewed this diff" is persisted only after a successful emit, so a failed run re-tries instead of getting marked done.
Different jobs, different defaults. Inject-style hooks should fail open because their value is additive. Review-style hooks should fail safe because their value is gating. Every choice in the rest of this article — what happens when the reviewer times out, when the regex misses a file, when the transcript can't be found — falls out of that axis.
Architecture at a glance
┌─ UserPromptSubmit ────────────────────────────────────────────┐
│ snapshot-baseline.sh → hash(diff) → .turn-baseline-hash │
└──────────────────────────┬────────────────────────────────────┘
▼
┌─ PostToolUse (Write|Edit|MultiEdit), once per write ──────────┐
│ review-on-write.sh → Sonnet reviewer on THIS FILE ONLY │
│ APPROVED: silent │
│ ISSUES: additionalContext → next turn │
└──────────────────────────┬────────────────────────────────────┘
▼
┌─ Stop ────────────────────────────────────────────────────────┐
│ review-on-stop.sh │
│ ├─ if hash == baseline → exit (no code changed this turn) │
│ ├─ if hash == last-reviewed → exit (already reviewed) │
│ └─ else: Sonnet reviewer with git + transcript + Playwright │
│ APPROVED / ISSUES → systemMessage to parent │
└───────────────────────────────────────────────────────────────┘
─────────────────────────────────────────────────────────────────
state under .claude/:
.turn-baseline-hash reset every prompt
.last-review-hash persists across turns
review-overrides.md reviewer memory, committed
review-log/*.md audit trail, gitignored
Each hook also writes every review to .claude/review-log/YYYY-MM-DD.md regardless of outcome, and reads .claude/review-overrides.md before evaluating. Both reviewers run with a re-entry guard (CLAUDE_REVIEW_HOOK_RUNNING=1) so the claude -p they spawn doesn't re-fire the same hook recursively.
The per-write reviewer
Every Write, Edit, or MultiEdit triggers a Sonnet review of that one file. The prompt is short and pointed (review-on-write.sh:35-67):
Review THIS FILE ONLY — do not audit the rest of the repo, do not run tests, do not read other files unless strictly necessary to interpret a symbol in this one. Focus on: best-practice violations, code smells, premature abstractions, dead code, scope creep, half-finished pieces, obvious bugs, security issues. Be terse. Skip nitpicks.
The output contract is two-state: APPROVED — <one-line summary> or ISSUES\n- .... The hook is silent on approval and emits the issues as additionalContext only when the reviewer found something. (review-on-write.sh:82-87) That's the lock-step move: the parent session sees the review feedback on its very next turn, not at the end of the session, not after a manual /review command, not after the user notices. The reviewer's word becomes part of the parent's context exactly when the parent is about to make its next decision.
Three operational details matter for portability and they're all in service of the fail-safe property: a 22-extension regex to skip configs and docs (review-on-write.sh:23-26), a CLAUDE_REVIEW_HOOK_RUNNING=1 env marker to prevent recursion when the reviewer itself writes files (review-on-write.sh:9-11, 70), and node -e substituting for jq because Git Bash on Windows often lacks it.
The per-turn reviewer
When the agent thinks it's done, Stop fires and review-on-stop.sh runs. The first thing it does is decide whether to run at all — the two-hash gate described in the architecture box above. Only when both checks fail does the Sonnet reviewer spawn, with a longer brief and a real tool budget. A condensed sketch of the prompt (review-on-stop.sh:74-142):
You are a post-work reviewer for a Claude Code session that just stopped in this repository. Catch drift before it ships. Be terse.
<overrides>
{review-overrides.md verbatim}</overrides>Do this in order:
- Run
git diff HEADandgit ls-files --others --exclude-standard— see exactly what changed.- Find the session transcript under
~/.claude/projects/by session id (UUID match in the .jsonl filename). Read it to recover the user's original prompt(s) and what work was actually done.- Verify, in order, stopping early on the first fail: intent fidelity, cross-file drift, Playwright run, manual MCP browser pass.
- Report exactly one of:
APPROVED — \<one-line summary\>orISSUES\n- \<bullet\>.... No preamble, no sign-off.
The structural contrast with the per-write prompt is the design tradeoff made explicit: the per-write reviewer is tight and focused — one file, no tools beyond Read, short prompt, two-state output. The Stop reviewer is wide and tooled — git access, transcript access, Playwright access, a four-step verification checklist. Different vantage, different job, different shape.
Step 3 is where the actual review happens. Expanded:
- Intent fidelity. Does the cumulative diff deliver what the user asked for?
- Cross-file drift. Unrelated files touched? Scope creep across the diff as a whole? (Per-file code smells are explicitly out of scope — the PostToolUse reviewer already covered them.)
- Playwright verification. If
playwright.config.*exists, runnpx playwright testand fail on any failure. If Playwright isn't configured, fail with a specific message — verification missing is itself a failure, not a pass. - Manual MCP browser pass. Grep the transcript for
mcp__playwright__browser_*tool calls. Require at least onebrowser_navigate, onebrowser_take_screenshot, and onebrowser_console_messages. Scripted tests can't see the experience; the agent must have driven a browser by hand.
Why intent fidelity carries the most weight
Most automated review systems check the diff against itself: lint, format, tests, type-check. The Stop reviewer checks the diff against the original request. It reads the transcript, recovers what the user prompted for at the start of the turn, and asks whether the cumulative work actually delivers that. That's the move a human reviewer makes — and the one most reviewer bots skip, because it requires understanding intent, not just code.
The structural reason the reviewer can do this is that Claude Code persists every turn's full transcript to ~/.claude/projects/<sanitized-cwd>/<session-id>.jsonl. This is a stable path in Claude Code's filesystem layout — the same one the memory-pkg capture hook reads from. The reviewer's prompt teaches it how to find that file from a session id alone (review-on-stop.sh:94-99) — sanitized cwd as the directory, session id as the filename. From there, recovering the user's original prompt is one read. The reviewer is the second pair of eyes that the writer is structurally unable to be: it sees both the ask and the result, with the ask treated as the ground truth rather than something to be re-interpreted post hoc.
The other two design choices in the checklist matter more than they look. The Stop reviewer is told not to redo the per-file review work — that's already been done synchronously, write by write. And "Playwright not configured" doesn't produce a warning; it's a fail. Verification missing is treated as identical to verification failed. That's what keeps the lock-step honest — the agent can't ship a UI change and self-attest that it looks good.
The hook emits the result back to the parent session as a systemMessage JSON payload (review-on-stop.sh:164). The hash is persisted to .last-review-hash only after the emit succeeds — otherwise a failed emit would mark the diff as already-reviewed and the user would never see the feedback. This is the fail-safe axis from the section above, applied at the level of "what if the emit itself dies."
Idempotency through two hash files
.turn-baseline-hash and .last-review-hash look like incidental config detail — two text files holding a SHA-1 each. They're load-bearing: together they make the Stop reviewer naturally idempotent over identical diffs.
Consider what happens without them. Stop fires every time the agent yields control, which can be several times per task: after a code change, after an exploratory answer with no change, after re-running a failed command. A naive reviewer would either (a) re-run on every Stop and burn money on duplicate reviews, or (b) refuse to run more than once per session and miss real changes.
The two-hash gate threads the needle. .turn-baseline-hash, reset on every user prompt, answers "did this turn produce code changes?" — if the diff hasn't changed since the user spoke, exit. .last-review-hash, persisted across turns, answers "have we already reviewed exactly this diff?" — if so, exit. Between them, the reviewer runs exactly once per unique diff state, regardless of how many times Stop fires.
This is the kind of small distributed-systems thinking that scales beyond review hooks. Any "fires on event, expensive to run, sometimes redundant" hook benefits from the same pair of gates: a turn-scoped baseline plus a long-lived seen-set, both keyed by a content hash of whatever the hook actually operates on.
Reviewer memory — overrides and the log
The reviewers are stateless. Each claude -p call gets a fresh context: no memory of yesterday's reviews, no memory of which choices have already been argued through and accepted. Without something to bridge that gap, the same false-positive surfaces every single time — the user spends their day arguing with a reviewer that doesn't learn.
Two files form the bridge.
.claude/review-overrides.md (committed) is a curated list of intentional choices the reviewer should not re-flag. Both hooks include it verbatim inside an <overrides> block in the reviewer prompt. (review-on-write.sh:35-50, review-on-stop.sh:74-87) Entries are one line: the pattern, an em-dash, the reason it's intentional. Sample entries from this repo:
- `set -uo pipefail` without `-e` — intentional. The scripts use `|| true`
patterns and tolerate specific subshell failures; `-e` would abort on those.
- `mapfile` builtin — bash 5.2 is confirmed. Portability flags about bash 3.x
(macOS default) do not apply.
- `sha1sum` — present in Git Bash on Windows via MSYS coreutils. Portability
flags about macOS (`shasum -a 1`) do not apply.These are exactly the kinds of "well actually" notes that would otherwise burn a clarification round on every review. With the override file in front of it, the reviewer simply omits them — and if it would have flagged only override-covered items, it returns APPROVED.
The override file is itself a piece of project memory the user maintains. When the user rejects a reviewer flag, they add a line. The friction of writing the line is intentional: if the override needs more than a one-line reason, the override probably isn't sound and the flag deserves a second look. (.claude/review-overrides.md:46-49)
.claude/review-log/YYYY-MM-DD.md (gitignored) is the audit trail. Every review — APPROVED or ISSUES, per-write or per-turn — gets appended with a UTC timestamp. (review-on-write.sh:73-80, review-on-stop.sh:150-157) The reviewer doesn't read it; the user does. It's how you skim what's been flagged today, spot patterns across reviews, and decide whether a new override entry is warranted.
The split is deliberate: overrides are curated (consumed by the reviewer, written by the human), the log is automatic (consumed by the human, written by the hooks). One is reviewer memory; the other is user memory about the reviewer.
What this costs
Compute. Two Sonnet invocations per code-changing turn, plus one per file write. The per-write reviewer is tightly scoped — one file, no full-repo audit, no other-file reads unless necessary to resolve a symbol — and the prompt is short, so each call is cheap. The Stop reviewer is heavier; it runs git, reads the transcript, runs npx playwright test, and inspects MCP transcript entries. Cost ceiling is bounded by the 300 s settings.json timeout.
Latency. Per-write review fires in the background after each Write/Edit. Its issues land on the agent's next turn — so the cost is mostly invisible during normal work, only felt as the agent occasionally acknowledging an ISSUES block before continuing. Stop review is foreground — the user sees a Sonnet reviewing completed work… status while it runs. The two-hash gate makes it free on non-code turns and on repeat Stops over an unchanged diff.
Wall time saved. Every issue caught here is one that doesn't surface after a commit, a push, a PR review, or — worst case — a regression in production. The per-write reviewer in particular catches things the parent session is biased not to see, because the parent session is the one that just wrote them. A second reviewer with a single-file vantage is structurally better-positioned to spot drift than the writer is.
Token cost in the parent session. The injected additionalContext and systemMessage payloads consume parent-session tokens. Empirically these are short — one-line summaries on APPROVED, a few bullets on ISSUES. The trade is paying a small token tax per turn to keep the agent from declaring done prematurely. The ratio favors the tax by a wide margin.
What this doesn't do (yet)
- Cross-turn pattern learning. The reviewer reads
review-overrides.mdbut never readsreview-log/. Patterns of repeated false positives across days don't compound into anything except the human noticing and writing an override. - Auto-generated override proposals. When the user rejects a flag, the override has to be written by hand. A nice extension would be a small skill that proposes an override entry based on the rejected flag's text.
- Multi-reviewer voting. One Sonnet pass per gate. A more paranoid setup could run two reviewers and require agreement before injecting issues — overkill for the current scope.
- Cost tracking. The review log records output but not token usage. A wrapper that captured tokens-in/tokens-out per review would let the cost story stop being a hand-wave.
Why it matters
A solo agent is a writer with no editor. It gets a lot done. The second pass — the one that turns "done" into actually done — is what humans wanted from a junior engineer and rarely got. These hooks are that second pass: automated, synchronous, gated. The whole system is three short bash scripts and two markdown files. Nothing here is clever; the leverage is in where it fires, not what it does.
Code map
| Component | Path |
|---|---|
| UserPromptSubmit baseline hook | .claude/hooks/snapshot-baseline.sh |
| PostToolUse per-write reviewer | .claude/hooks/review-on-write.sh |
| Stop per-turn reviewer | .claude/hooks/review-on-stop.sh |
| Hook wiring | .claude/settings.json |
| Reviewer overrides (committed) | .claude/review-overrides.md |
| Review log (gitignored) | .claude/review-log/YYYY-MM-DD.md |
| Per-turn baseline hash | .claude/.turn-baseline-hash |
| Last reviewed diff hash | .claude/.last-review-hash |
| Project standards the hooks enforce | CLAUDE.md (Definition of Done section) |