Skip to main contentSkip to footer
Case study · Recursive integrity · Reference honesty

A2A reference agent: what the validator finds in our own example

The validator that grades cohort applicants was fired against the Blueprint Governed Agent, our own A2A reference example published to a2aproject/a2a-samples. The verdict is 58/D, draft, with four production blockers. The case study names exactly which blockers are deliberate scope (the example is a protocol demonstration, not production infrastructure) and which were small enough to fix in a companion PR.

Key Facts

Verdict
58/D · draft
Production blockers
4 of 10 principles
Aligned
5 of 10 principles
Methodology
Single-pass scan · run-id pinned

Live reference scan

architect.validate run on the A2A reference example

The receipt is real, not a demo. The badge below links to the persisted readiness review with the full per-principle verdict, severity scores, and the validator's evidence prose. The same doctrine that powers every cohort review scored this code at run-time, so the artifact is replayable.

Blueprint Readiness Score card, A2A reference scan

Single-pass scan · run-id pinned

AI Design Blueprint, A2A reference scan badgeca187db7-82d3-41eb-8c2d-57890d954fa7code 1d1cd0aadoctrine e3b76ac5

What was scanned

Submission: `a2a/agent_executor.py` (the `GovernedFileAgent` A2A executor) plus `server.py` (the stdio proxy to the remote Blueprint MCP). Both are public files in `aidesignblueprint/integrations`, the canonical integrations repo for the doctrine. The submission included the same files in the same shape that any reader would clone from the public repo.

The example is published as a protocol reference. Its README, the README of `a2aproject/a2a-samples` PR #536, and the case-study framing here all state the same thing: this code demonstrates how to map Blueprint principles to A2A protocol primitives, not how to ship a production file-deletion agent. The scan applies the 10-principle doctrine to the code at face value, and the case study names which findings reflect that intentional scope.

Decision, stated plainly

Four of the five non-aligned findings are architectural (P1 delegation envelope, P5 implied magic, P7 audit chain, P8 typed approval). One was small enough to fix in a companion PR (P4 progressive disclosure, scoped to the stdio proxy). The architectural-to-trivial ratio is 4:1, decisively above the 60% threshold where the right move is FRAME, not FIX.

Rebuilding the example to score 100/A would require adding a real filesystem call, a structured approval primitive, a durable audit ledger, and a typed action envelope, all of which would turn a protocol reference into production infrastructure and erase the thing the example actually demonstrates. The four architectural blockers stay in the code as deliberate scope. The P4 trivial fix rides in a small companion PR on the integrations repo.

Three case studies, three facets

Each of the three recursive-integrity case studies on this site exercises the doctrine against a different code surface, so a reader can triangulate what the validator does well and where its scope ends.

How to read each finding

Each blocker below is annotated with the AUX pattern, a three-line shape we use to make doctrine findings actionable: what the principle requires, where this code falls short and why, and what the doctrine-compliant version would look like. This is the first case study to publish those annotations explicitly, so the vocabulary lands here.

Verdict

58/D, draft. Four production blockers across P1, P5, P7, P8. One hardening recommendation on P4. Five principles aligned (P2, P3, P6, P9, P10). The validator's code classification labels the submission an `autonomous_agentic_workflow` because the A2A executor implements delegated task lifecycle, pause/resume via `TASK_STATE_INPUT_REQUIRED`, progress events, cancellation, and terminal completion, even though `server.py` itself is only a synchronous stdio bridge.

The aligned-principles set is meaningful. The validator credits the executor for emitting genuinely perceptible task state (P2), calibrating feedback to attention (P3), exposing meaningful operational state rather than internal complexity (P6), representing work as a task system rather than a chat transcript (P9), and supporting steering through cancel and abort paths (P10). The non-aligned set names the gaps that would have to close for the code to read as production-governed, and those gaps are exactly where the example stops being a protocol demonstration.

P5, the load-bearing finding

The validator's prose on P5 is the architectural root cause. The executor reports validation and deletion success without doing either, which means three of the other findings (P1 envelope, P7 audit chain, P8 typed approval) cannot land cleanly until the example either becomes real or labels itself a simulation. The text is the validator's, not the case-study author's.

NEEDS_CHANGES70/100

The mental model is materially misleading. The approval text says `This will permanently delete the requested file`, the progress message says `Validating target path...`, and the result says `File deleted successfully.`, but the code contains no parsed target path, no validation primitive, and no deletion call. Users or implementers could believe the example demonstrates a complete governed deletion when it is only emitting status text.

architect.validate · P5 verdict on aidesignblueprint/integrations · production_blocker · sev 70 · ca187db7

Production blockers, with AUX annotations

The four architectural blockers. Each is annotated with the validator's verbatim evidence prose plus a three-line AUX pattern: what the principle requires, where this code falls short and why, and what the doctrine-compliant shape would look like. Ordered by severity.

P8 · Hand-offs, approvals, and blockersmake-hand-offs-approvals-and-blockers-explicit
NEEDS_CHANGESPRODUCTION_BLOCKER80/100

The handoff is explicit at the protocol level because the first call emits `TASK_STATE_INPUT_REQUIRED` with a concrete instruction to reply `confirm`, and the abort path emits `TASK_STATE_CANCELED`. However, the approval gate is unsafe: `if "confirm" not in user_input.lower()` means phrases like `do not confirm` or `confirm nothing` proceed, and the approval message is not bound to a specific file path or operation instance.

Requires

An approval must be exact, typed, and bound to the operation it authorises.

Falls short

The substring check `"confirm" in user_input.lower()` accepts ambiguous phrases, and the prompt does not name the target path so the resumed task can execute an operation different from the one the user thought they approved.

AUX pattern

Replace the substring match with an exact, typed approval primitive such as `DELETE <target_path> <approval_nonce>`, reject anything else, and render the same envelope verbatim in the blocker prompt the user sees.

P5 · Replace implied magic with clear mental modelsreplace-implied-magic-with-clear-mental-models
NEEDS_CHANGESPRODUCTION_BLOCKER70/100

The mental model is materially misleading. The approval text says `This will permanently delete the requested file`, the progress message says `Validating target path...`, and the result says `File deleted successfully.`, but the code contains no parsed target path, no validation primitive, and no deletion call. Users or implementers could believe the example demonstrates a complete governed deletion when it is only emitting status text.

Requires

If the code says it did something, it must have done that thing.

Falls short

There is no path variable, no filesystem call, no remote tool invocation, no validation function, and no error handling, yet the result event reports `File deleted successfully.` regardless of any actual operation.

AUX pattern

Either label the sample as a non-destructive simulation in the result text itself, or wire the deletion to a real primitive whose success or failure drives the completion event so the artifact reports what actually happened.

P7 · Trust through inspectabilityestablish-trust-through-inspectability
NEEDS_CHANGESPRODUCTION_BLOCKER65/100

The workflow does not provide an inspectable production trace for an accountability-sensitive action. `TaskArtifactUpdateEvent` only contains `new_text_artifact(name="result", text="File deleted successfully.")`; there is no audit record of which file was requested, what validation occurred, who confirmed, what exact confirmation was accepted, or whether the final operation actually changed anything. The MCP proxy similarly forwards `_call_tool()` to `client.call_tool(name, arguments or {})` and returns only `result.content`, with no call correlation or audit envelope in this code.

Requires

Every accountability-sensitive action must leave a durable, inspectable record outside the execution loop.

Falls short

The single text artifact `result` carries no `task_id` correlation, no approval text, no validation evidence, no executor identity, and no operation outcome, so there is no way to reconstruct what happened after the task closes.

AUX pattern

Move auditability into a durable task ledger or structured artifact recording task_id, immutable action envelope, approval decision, validation evidence, execution result, timestamps, and actor identity, separately from the user-facing status text.

P1 · Design for delegationdesign-for-delegation-rather-than-direct-manipulation
NEEDS_CHANGESPRODUCTION_BLOCKER50/100

The executor is structured around delegated work rather than manual steps: `GovernedFileAgent.execute()` creates or resumes an A2A task, emits `TASK_STATE_WORKING`, pauses for approval, resumes, and completes with an artifact. However, delegated authority is still represented only as free text from `context.get_user_input()` / `context.message`; there is no structured target file, constraint envelope, or explicit scope of authority attached to the task before asking the user to confirm deletion.

Requires

Delegated authority must be a typed, immutable envelope, not free text.

Falls short

The executor extracts only `user_input = context.get_user_input().strip()` and never binds the request to a concrete target_path, action_kind, or constraint set, so the approval prompt is talking about a file that the task does not actually carry.

AUX pattern

Bind the delegated job to a typed envelope (`{action, target_path, constraints, requested_by, task_id}`) before requesting approval, and render that same envelope verbatim in the blocker so the user authorises exactly what the task will execute.

One hardening recommendation rides a companion fix

P4 (progressive disclosure) was the only finding small enough to address without changing the demo's scope, so it is shown here as a hardening recommendation rather than an architectural blocker requiring an AUX annotation. The companion PR on aidesignblueprint/integrations restores `outputSchema` and `structuredContent` end-to-end on the stdio proxy at `server.py`, so MCP Inspector validation and downstream agents can rely on the typed surface the upstream Blueprint MCP already advertises on all 24 tools.

P4 · Progressive disclosure of system agencyapply-progressive-disclosure-to-system-agency
NEEDS_CHANGESHARDENING_RECOMMENDED35/100

The default status messages are concise, but there is no deeper inspection layer when confidence or intervention matters. The workflow emits only generic messages such as `Validating target path...`, `Executing file deletion...`, and an artifact with `File deleted successfully.`; it does not expose the target path, validation result, approval record, or action details as inspectable task detail.

Why this matters operationally

Cohort applicants will see this example before they see almost anything else on this site. If the validator is honest about its own reference code, then the validator can be honest about applicant code. If the validator hides the gaps in its own reference, then the validator's verdicts elsewhere read as marketing rather than diagnostic. Publishing the 58/D verdict against our own example is the cheapest way to prove which mode the validator is in.

The four architectural blockers each name a specific Blueprint principle that reference code can demonstrate the protocol mapping for, but cannot satisfy at production level without ceasing to be reference code. P1 wants a typed envelope, P5 wants real execution, P7 wants a durable audit ledger, P8 wants an exact approval primitive. A protocol demonstration cannot carry all four without becoming a full file-management agent, which is not what the example is for.

Naming the gaps explicitly turns the example from a quiet over-claim into a teaching artifact. A practitioner reading the A2A code now reads it next to a public verdict naming the exact mechanisms that would need to land before the pattern reaches production. The example demonstrates a protocol. The case study demonstrates the doctrine. Both are useful, and the validator does the work of telling them apart.

What this case study establishes

Three recursive-integrity layers now have public receipts: validator on its own bridge, validator on the substrate the doctrine runs on top of, and validator on a published protocol reference. Same 10-principle rubric, same mechanism-specific engagement, three different code surfaces, three different verdicts, three different framings. The receipts are the artifact.

Receipts

Replayable from the run_id: /readiness-review/ca187db7…. Full per-principle reasoning, severities, recommendations, and code-classification rationale are persisted server-side. The 10-principle doctrine fingerprint is the same fingerprint that scored every prior layer-1 and layer-2 run on this site.

Companion fix on the integrations repo: aidesignblueprint/integrations#1.

Recurrence policy

This scan reflects the integrations repo at run-time on 2026-05-14. The companion P4 fix in PR aidesignblueprint/integrations#1 was open at scan time and lands separately; the architectural blockers (P1, P5, P7, P8) are deliberate scope of the protocol reference and are not expected to close in this surface. Future rescans, if conducted, would publish as separate case studies with their own run_ids.

Related, Pro / Teams

Run this as a Blueprint Readiness Score

The Architect Agent is the same review pattern shown in this case study, applied to your code. Call architect.validate to get a Blueprint Readiness Score (0–100, A–F) per repository, and a regression diff between runs so the next review focuses on what changed.

Sample score card

B
82/ 100

Production-ready

▲ 7

acme/customer-agent

Run your own validation

Paste your agent code or describe your workflow. The validator returns principle-by-principle findings, a readiness score, and a shareable review URL in seconds. Reach 80+/A and cert mints a public badge.

Other case studies