Case study · Recursive integrity · Reference honesty
A2A reference agent: what the validator finds in our own example
The validator that grades cohort applicants was fired against the Blueprint Governed Agent, our own A2A reference example published to a2aproject/a2a-samples. The verdict is 58/D, draft, with four production blockers. The case study names exactly which blockers are deliberate scope (the example is a protocol demonstration, not production infrastructure) and which were small enough to fix in a companion PR.
Key Facts
Verdict
58/D · draft
Production blockers
4 of 10 principles
Aligned
5 of 10 principles
Methodology
Single-pass scan · run-id pinned
What was scanned
Submission: `a2a/agent_executor.py` (the `GovernedFileAgent` A2A executor) plus `server.py` (the stdio proxy to the remote Blueprint MCP). Both are public files in `aidesignblueprint/integrations`, the canonical integrations repo for the doctrine. The submission included the same files in the same shape that any reader would clone from the public repo.
The example is published as a protocol reference. Its README, the README of `a2aproject/a2a-samples` PR #536, and the case-study framing here all state the same thing: this code demonstrates how to map Blueprint principles to A2A protocol primitives, not how to ship a production file-deletion agent. The scan applies the 10-principle doctrine to the code at face value, and the case study names which findings reflect that intentional scope.
Decision, stated plainly
Four of the five non-aligned findings are architectural (P1 delegation envelope, P5 implied magic, P7 audit chain, P8 typed approval). One was small enough to fix in a companion PR (P4 progressive disclosure, scoped to the stdio proxy). The architectural-to-trivial ratio is 4:1, decisively above the 60% threshold where the right move is FRAME, not FIX.
Rebuilding the example to score 100/A would require adding a real filesystem call, a structured approval primitive, a durable audit ledger, and a typed action envelope, all of which would turn a protocol reference into production infrastructure and erase the thing the example actually demonstrates. The four architectural blockers stay in the code as deliberate scope. The P4 trivial fix rides in a small companion PR on the integrations repo.
Three case studies, three facets
Each of the three recursive-integrity case studies on this site exercises the doctrine against a different code surface, so a reader can triangulate what the validator does well and where its scope ends.
How to read each finding
Each blocker below is annotated with the AUX pattern, a three-line shape we use to make doctrine findings actionable: what the principle requires, where this code falls short and why, and what the doctrine-compliant version would look like. This is the first case study to publish those annotations explicitly, so the vocabulary lands here.
Verdict
58/D, draft. Four production blockers across P1, P5, P7, P8. One hardening recommendation on P4. Five principles aligned (P2, P3, P6, P9, P10). The validator's code classification labels the submission an `autonomous_agentic_workflow` because the A2A executor implements delegated task lifecycle, pause/resume via `TASK_STATE_INPUT_REQUIRED`, progress events, cancellation, and terminal completion, even though `server.py` itself is only a synchronous stdio bridge.
The aligned-principles set is meaningful. The validator credits the executor for emitting genuinely perceptible task state (P2), calibrating feedback to attention (P3), exposing meaningful operational state rather than internal complexity (P6), representing work as a task system rather than a chat transcript (P9), and supporting steering through cancel and abort paths (P10). The non-aligned set names the gaps that would have to close for the code to read as production-governed, and those gaps are exactly where the example stops being a protocol demonstration.
P5, the load-bearing finding
The validator's prose on P5 is the architectural root cause. The executor reports validation and deletion success without doing either, which means three of the other findings (P1 envelope, P7 audit chain, P8 typed approval) cannot land cleanly until the example either becomes real or labels itself a simulation. The text is the validator's, not the case-study author's.
NEEDS_CHANGES70/100
The mental model is materially misleading. The approval text says `This will permanently delete the requested file`, the progress message says `Validating target path...`, and the result says `File deleted successfully.`, but the code contains no parsed target path, no validation primitive, and no deletion call. Users or implementers could believe the example demonstrates a complete governed deletion when it is only emitting status text.
Production blockers, with AUX annotations
The four architectural blockers. Each is annotated with the validator's verbatim evidence prose plus a three-line AUX pattern: what the principle requires, where this code falls short and why, and what the doctrine-compliant shape would look like. Ordered by severity.
P8 · Hand-offs, approvals, and blockersmake-hand-offs-approvals-and-blockers-explicit
NEEDS_CHANGESPRODUCTION_BLOCKER80/100
The handoff is explicit at the protocol level because the first call emits `TASK_STATE_INPUT_REQUIRED` with a concrete instruction to reply `confirm`, and the abort path emits `TASK_STATE_CANCELED`. However, the approval gate is unsafe: `if "confirm" not in user_input.lower()` means phrases like `do not confirm` or `confirm nothing` proceed, and the approval message is not bound to a specific file path or operation instance.
Requires
An approval must be exact, typed, and bound to the operation it authorises.
Falls short
The substring check `"confirm" in user_input.lower()` accepts ambiguous phrases, and the prompt does not name the target path so the resumed task can execute an operation different from the one the user thought they approved.
AUX pattern
Replace the substring match with an exact, typed approval primitive such as `DELETE <target_path> <approval_nonce>`, reject anything else, and render the same envelope verbatim in the blocker prompt the user sees.
P5 · Replace implied magic with clear mental modelsreplace-implied-magic-with-clear-mental-models
NEEDS_CHANGESPRODUCTION_BLOCKER70/100
The mental model is materially misleading. The approval text says `This will permanently delete the requested file`, the progress message says `Validating target path...`, and the result says `File deleted successfully.`, but the code contains no parsed target path, no validation primitive, and no deletion call. Users or implementers could believe the example demonstrates a complete governed deletion when it is only emitting status text.
Requires
If the code says it did something, it must have done that thing.
Falls short
There is no path variable, no filesystem call, no remote tool invocation, no validation function, and no error handling, yet the result event reports `File deleted successfully.` regardless of any actual operation.
AUX pattern
Either label the sample as a non-destructive simulation in the result text itself, or wire the deletion to a real primitive whose success or failure drives the completion event so the artifact reports what actually happened.
P7 · Trust through inspectabilityestablish-trust-through-inspectability
NEEDS_CHANGESPRODUCTION_BLOCKER65/100
The workflow does not provide an inspectable production trace for an accountability-sensitive action. `TaskArtifactUpdateEvent` only contains `new_text_artifact(name="result", text="File deleted successfully.")`; there is no audit record of which file was requested, what validation occurred, who confirmed, what exact confirmation was accepted, or whether the final operation actually changed anything. The MCP proxy similarly forwards `_call_tool()` to `client.call_tool(name, arguments or {})` and returns only `result.content`, with no call correlation or audit envelope in this code.
Requires
Every accountability-sensitive action must leave a durable, inspectable record outside the execution loop.
Falls short
The single text artifact `result` carries no `task_id` correlation, no approval text, no validation evidence, no executor identity, and no operation outcome, so there is no way to reconstruct what happened after the task closes.
AUX pattern
Move auditability into a durable task ledger or structured artifact recording task_id, immutable action envelope, approval decision, validation evidence, execution result, timestamps, and actor identity, separately from the user-facing status text.
P1 · Design for delegationdesign-for-delegation-rather-than-direct-manipulation
NEEDS_CHANGESPRODUCTION_BLOCKER50/100
The executor is structured around delegated work rather than manual steps: `GovernedFileAgent.execute()` creates or resumes an A2A task, emits `TASK_STATE_WORKING`, pauses for approval, resumes, and completes with an artifact. However, delegated authority is still represented only as free text from `context.get_user_input()` / `context.message`; there is no structured target file, constraint envelope, or explicit scope of authority attached to the task before asking the user to confirm deletion.
Requires
Delegated authority must be a typed, immutable envelope, not free text.
Falls short
The executor extracts only `user_input = context.get_user_input().strip()` and never binds the request to a concrete target_path, action_kind, or constraint set, so the approval prompt is talking about a file that the task does not actually carry.
AUX pattern
Bind the delegated job to a typed envelope (`{action, target_path, constraints, requested_by, task_id}`) before requesting approval, and render that same envelope verbatim in the blocker so the user authorises exactly what the task will execute.
One hardening recommendation rides a companion fix
P4 (progressive disclosure) was the only finding small enough to address without changing the demo's scope, so it is shown here as a hardening recommendation rather than an architectural blocker requiring an AUX annotation. The companion PR on aidesignblueprint/integrations restores `outputSchema` and `structuredContent` end-to-end on the stdio proxy at `server.py`, so MCP Inspector validation and downstream agents can rely on the typed surface the upstream Blueprint MCP already advertises on all 24 tools.
P4 · Progressive disclosure of system agencyapply-progressive-disclosure-to-system-agency
NEEDS_CHANGESHARDENING_RECOMMENDED35/100
The default status messages are concise, but there is no deeper inspection layer when confidence or intervention matters. The workflow emits only generic messages such as `Validating target path...`, `Executing file deletion...`, and an artifact with `File deleted successfully.`; it does not expose the target path, validation result, approval record, or action details as inspectable task detail.
Why this matters operationally
Cohort applicants will see this example before they see almost anything else on this site. If the validator is honest about its own reference code, then the validator can be honest about applicant code. If the validator hides the gaps in its own reference, then the validator's verdicts elsewhere read as marketing rather than diagnostic. Publishing the 58/D verdict against our own example is the cheapest way to prove which mode the validator is in.
The four architectural blockers each name a specific Blueprint principle that reference code can demonstrate the protocol mapping for, but cannot satisfy at production level without ceasing to be reference code. P1 wants a typed envelope, P5 wants real execution, P7 wants a durable audit ledger, P8 wants an exact approval primitive. A protocol demonstration cannot carry all four without becoming a full file-management agent, which is not what the example is for.
Naming the gaps explicitly turns the example from a quiet over-claim into a teaching artifact. A practitioner reading the A2A code now reads it next to a public verdict naming the exact mechanisms that would need to land before the pattern reaches production. The example demonstrates a protocol. The case study demonstrates the doctrine. Both are useful, and the validator does the work of telling them apart.
What this case study establishes
Three recursive-integrity layers now have public receipts: validator on its own bridge, validator on the substrate the doctrine runs on top of, and validator on a published protocol reference. Same 10-principle rubric, same mechanism-specific engagement, three different code surfaces, three different verdicts, three different framings. The receipts are the artifact.
Receipts
Replayable from the run_id: /readiness-review/ca187db7…. Full per-principle reasoning, severities, recommendations, and code-classification rationale are persisted server-side. The 10-principle doctrine fingerprint is the same fingerprint that scored every prior layer-1 and layer-2 run on this site.
This scan reflects the integrations repo at run-time on 2026-05-14. The companion P4 fix in PR aidesignblueprint/integrations#1 was open at scan time and lands separately; the architectural blockers (P1, P5, P7, P8) are deliberate scope of the protocol reference and are not expected to close in this surface. Future rescans, if conducted, would publish as separate case studies with their own run_ids.