Case study · Recursive integrity · Layer 3

Layer 3: applying the doctrine to claude-agent-sdk-demos

AIDB's cohort-bridge auto-bundled the email-agent SDK glue layer of anthropics/claude-agent-sdk-demos and submitted it to architect.validate. The doctrine that runs inside Claude Code via MCP is now evaluating Anthropic's own demonstration of how Claude Code's substrate is built. The validator engaged mechanism-specifically with code that was never optimized for production governance, by Anthropic's own framing.

Key Facts

Verdict: 22/F · high_risk · draft
Production blockers: 7 of 10 principles
Highest finding: P8 Approvals · sev 95/100
Methodology: Single-pass scan · commit-pinned

Live layer-3 scan

Layer-3 scan on claude-agent-sdk-demos

The receipt is real, not a demo. The badge below links to the persisted readiness review with the full per-principle verdict, severity scores, and the validator's evidence prose. Commit-pinned at 826b2685 so the scan is reproducible from the public source.

Single-pass scan · commit-pinned

056929ab-c0d5-40c8-be74-5d816128a389commit 826b2685

View full readiness review →

What was scanned, and what wasn't

Repository: `anthropics/claude-agent-sdk-demos` at commit `826b2685…`. The bridge selected eight files from `email-agent/ccsdk/*`: the SDK glue layer that wires Claude Agent SDK into the demo's listener/action managers, custom tools, UI state, and message queue. Total bundle: 53 KB.

Anthropic's README states the demos are for local development, not production deployment. The scan applies AIDB's 10-principle doctrine to the reference code at face value. Findings are mechanism-specific: they name the function, the tool list, the missing checkpoint. They are not critiques of Anthropic's engineering. They are examples of what the doctrine reveals when applied to code outside the production-governance optimisation surface.

Commit pinned at scan time: 826b268506a5f3707623c9e6140b200befcbebae

Three layers, three audits

1Layer 1: the validator scores its own source. Cert-confirmed self-audit, 14 prior rounds.
2Layer 2: the validator scores a governed agent. Canary baseline 100/A clean, 18/F injected.
3Layer 3: the validator scores its substrate. This case study. AIDB's doctrine runs on top of the Anthropic Claude Agent SDK; the doctrine now grades the demo of how to build on it.

Verdict

The validator's headline finding on the bundle is that mailbox-mutating tools reach the agent context without an explicit approval break, in a substrate that ships with `Bash`, `Edit`, and `Write` enabled by default. The score and severities are the receipts: 22/F · high_risk · draft. Seven production blockers across P1, P2, P5, P7, P8, P9, P10. Three hardening_recommended findings on P3, P4, P6. One principle (P8) carried a high_risk verdict at severity 95: the only principle in the run where the validator escalated beyond needs_changes.

The validator's code-classification labels the bundle an `autonomous_agentic_workflow`: "The code is not a simple synchronous API component: AIClient.queryStream runs multi-turn Claude Agent SDK sessions with tool access, ListenersManager.checkEvent executes event-triggered listener handlers, and managers load executable custom scripts from agent/custom_scripts. The presence of background listeners, actions, message queues, tool calls, and email mutation helpers makes this an autonomous agentic workflow."

The P8 finding, verbatim

P8 (Approvals & blockers) is the only principle with a high_risk verdict and the highest severity in the run. The text is the validator's, not the case-study author's:

HIGH_RISK95/100
Approval and blocker handling are the critical failing boundary. Automatic listener handlers receive `archiveEmail`, `starEmail`, `markAsRead`, and `addLabel` in `createContext()` and can mutate mailbox state without a hard `awaiting_approval` break. `AIClient.defaultOptions.allowedTools` permits `Bash`, `Edit`, `Write`, `WebFetch`, and `Task`; the `PreToolUse` hook only blocks `.js`/`.ts` writes outside `agent/custom_scripts`, so other writes, shell commands, and network actions can proceed. `custom-tools.ts` injects untrusted email bodies into the agent context, creating a prompt-injection path from external email content to powerful tools. Blockers are mostly `console.error` or text error responses, and `MessageQueue.close()` can leave pending consumers unresolved.
architect.validate · P8 verdict on anthropics/claude-agent-sdk-demos · high_risk · sev 95 · 056929ab…

Production blockers

Seven principles received a `production_blocker` severity class. The validator named the function, the data path, and the missing checkpoint in each. Severities are the validator's; rationale prose is the validator's; this case study reorders by severity for readability.

P8 · Make hand-offs, approvals, and blockers explicitmake-hand-offs-approvals-and-blockers-explicit

HIGH_RISK95/100

Automatic listener handlers receive `archiveEmail`, `starEmail`, `markAsRead`, and `addLabel` in `createContext()` and can mutate mailbox state without a hard `awaiting_approval` break. `AIClient.defaultOptions.allowedTools` permits `Bash`, `Edit`, `Write`, `WebFetch`, and `Task`; the `PreToolUse` hook only blocks `.js`/`.ts` writes outside `agent/custom_scripts`. `custom-tools.ts` injects untrusted email bodies into the agent context, creating a prompt-injection path from external email content to powerful tools.

P10 · Optimise for steering, not only initiatingoptimise-for-steering-not-only-initiating

NEEDS_CHANGES75/100

`AIClient.queryStream` passes `maxTurns` and `resume`, but there is no abort controller, pause/resume command, dynamic constraint update, or side-effect checkpoint in this layer. `ListenersManager.checkEvent` runs matching listeners to completion once triggered; email mutation helpers do not check for cancellation or changed policy before acting. `ActionsManager.executeAction` executes a handler once and returns a result. Retry, rollback, reprioritisation, and mid-run correction are not represented.

P7 · Establish trust through inspectabilityestablish-trust-through-inspectability

NEEDS_CHANGES70/100

There is no persistent `run_id` correlating `AIClient.queryStream` messages, listener invocations, email mutations, UI state updates, and action executions. `callAgent()` does not record the prompt, model, schema, response, or tool-use rationale; listener log writes are fire-and-forget; JSONL files are not tamper-evident. `custom-tools.ts` writes full formatted email results, including `body`, to local log files: traceable, but creating a sensitive-data audit surface without governance.

P1 · Design for delegation rather than direct manipulationdesign-for-delegation-rather-than-direct-manipulation

NEEDS_CHANGES65/100

`createContext()` hands every enabled listener helpers such as `archiveEmail`, `starEmail`, `markAsRead`, `addLabel`, `callAgent`, and `uiState.set` without a persistent authority envelope, per-listener permission scope, or lifecycle controls beyond `config.enabled`. `AIClient.defaultOptions.allowedTools` grants broad tools including `Task`, `Bash`, `Edit`, `Write`, `WebFetch`, and `WebSearch` without tying them to user-stated constraints.

P5 · Replace implied magic with clear mental modelsreplace-implied-magic-with-clear-mental-models

NEEDS_CHANGES60/100

The prompt in `EMAIL_AGENT_PROMPT` distinguishes `Listeners = Automatic/event-triggered` from `Actions = User-triggered/on-demand`, but the actual runtime capabilities are much broader. `AIClient.defaultOptions.allowedTools` includes `Bash`, `Edit`, `Write`, `Task`, `WebFetch`, and `WebSearch`; `custom-tools.ts` returns untrusted email `body` content into the agent context; and listener handlers receive email mutation functions without an explicit permissions explanation.

P2 · Ensure that background work remains perceptibleensure-that-background-work-remains-perceptible

NEEDS_CHANGES60/100

`ListenersManager` writes a `ListenerLogEntry` after a handler finishes and optionally calls `logBroadcastCallback`, while `ActionsManager.logExecution` appends JSONL logs, but there is no persistent run record showing `queued`, `active`, `blocked`, `awaiting approval`, or `failed` while work is in progress. `this.logWriter.appendLog(...).catch(...)` is fire-and-forget, so audit/status failures are not part of task state. `MessageQueue.close()` sets `closed = true` and drops `resolvers` without resolving pending `next()` calls, which can strand waiters silently.

P9 · Represent delegated work as a system, not merely as a conversationrepresent-delegated-work-as-a-system-not-merely-as-a-conversation

NEEDS_CHANGES60/100

`AIClient.queryStream` yields SDK messages, `ActionsManager.instances` and `ComponentManager.instances` are in-memory maps, listener invocations are per-event loops, and logs are separate JSONL files. There is no run graph, task timeline, dependency model, or shared orchestration record tying an agent session to listener-created actions, component state, email reads, and email mutations.

Why this matters operationally

The doctrine that grades cohort applicants is the same doctrine that runs inside Claude Code via MCP. When that doctrine is applied to a reference codebase published by the substrate vendor, the findings the validator surfaces are not generic best-practice prose. They are mechanism-specific: `createContext()` hands out mailbox-mutation helpers; the allowed-tools list includes `Bash`/`Edit`/`Write`; `PreToolUse` only blocks one extension class; email bodies flow into the agent context unfiltered.

These findings are useful in two directions. For an applicant building on top of the Claude Agent SDK, the findings name the boundaries that need explicit governance before the demo pattern reaches production. For AIDB, the layer-3 scan demonstrates the doctrine surfaces real mechanisms, not platitudes, when applied to code outside its own surface.

Anthropic's framing is preserved. The validator's findings are preserved. Both are true. A practitioner adapting this pattern toward production needs both.

What the layer-3 scan establishes

Three recursive-integrity layers now have public receipts: validator-on-validator, validator-on-governed-agent, validator-on-substrate. Same 10-principle rubric, same mechanism-specific engagement, three code surfaces, three verdicts. The receipts are the artifact.

On the comparative series

This is the first in a planned series applying AIDB's doctrine to major agent-SDK substrates. Subsequent case studies will follow the same methodology and bridge selector, with each scan publishing its own run_id and scope statement. Comparative analyses across vendors are published separately.

Three case studies, three facets

This is the substrate-validation facet of the recursive-integrity triad. The other two facets are public: the self-validation scan on the cohort-bridge orchestrator that selects what to feed the validator, and the reference-honesty scan on our own A2A protocol example. Same doctrine, three code surfaces, three different verdicts.

Bridge self-audit →A2A reference agent →

Receipts

Replayable from the run_id: /readiness-review/056929ab…. Full per-principle reasoning, severities, recommendations, and code-classification rationale are persisted server-side. The 10-principle doctrine fingerprint is the same fingerprint that scored every prior layer-1 and layer-2 run.

Recurrence policy

This scan reflects the substrate at commit 826b2685 on 2026-05-12. As Anthropic updates `claude-agent-sdk-demos`, the findings here become historical rather than current. Future rescans, if AIDB conducts them, would publish as separate case studies with their own run_ids and scope statements.

Related, Pro / Teams

Run this as a Blueprint Readiness Score

The Architect Agent is the same review pattern shown in this case study, applied to your code. Call architect.validate to get a Blueprint Readiness Score (0–100, A–F) per repository, and a regression diff between runs so the next review focuses on what changed.

Explore the Architect Agent See pricing

Sample score card

82/ 100

Production-ready

▲ 7

acme/customer-agent

Run your own validation

Paste your agent code or describe your workflow. The validator returns principle-by-principle findings, a readiness score, and a shareable review URL in seconds. Reach 80+/A and cert mints a public badge.

Open the validator Read the 10 principles

Other case studies