Skip to main contentSkip to footer
Reviewed

Assessment complete; awaiting evidence revision.

Agent Architecture Review, Validation snapshot

Evaluated 23 May 2026 against the AI Design Blueprint doctrine

Production-ready

Status: Aligned

98/100

Grade A

9 aligned1 hardening

production_ready means trust boundaries hold. The hardening recommendations below are iteration material, not a deficit: that's what production_ready means under the doctrine.

Blueprint Readiness measures doctrine alignment, not runtime correctness. A production-ready verdict means the architecture embodies the 10 principles; it does not run your tests or types. Layer it on top of your test suite, not in place of it.

Per-principle verdicts

The submitted code is an autonomous background validation workflow and the SEP-1686 task adoption materially closes the durable lifecycle gap: `task_id`, `task_state`, history recovery, projected UI status, dedupe signalling, and cancellation are now represented as durable product primitives rather than chat-scrollback state. The main remaining gap is inspectability of how the validation result was produced: the snippets show lifecycle/state persistence, but not a compact provenance or event ledger for model/tool/policy decisions behind `_execute_validate_body()`.

Not yet certified

What to expect from certification

This run is eligible for the certified production_ready badge. Certification is an adversarial second-pass review, independent of the first pass. It's the extra layer of proof that separates a "scored production_ready" run from a certified one.

Three possible outcomes:

  • confirmed_production_readythe cert reviewer agrees with the first pass. The certified badge mints.
  • downgraded_to_emergingthe cert reviewer surfaces a production_blocker the first pass missed. Tier is capped at emerging.
  • unavailable_provider_errortransient LLM provider error. Retry; doesn't count as a downgrade.

A downgrade is by design, not a defect. The cert reviewer is an adversarial, independent, deliberately stricter second pass. When it downgrades, it's doing its job: catching what the first pass missed. That's the additional layer that makes production_ready a guarantee rather than an estimate.

To certify this run: call architect.certify(run_id, code) via MCP, or from the app on a Pro/Teams plan. Three attempts per run; each attempt is one LLM call (typical 60-180 seconds at high reasoning effort; server-side budget 20 minutes).

Per-principle findings

10 principles evaluated. Verdict, severity, evidence and recommendation for each.

P7

needs changeshardening recommended35/100

Establish trust through inspectability

The submission improves lifecycle inspectability with durable `UserValidationRun` rows, `task_id`, `task_state`, `structuredContent=task_result_dict`, and `me.validation_history(run_id)`. However, the snippets do not show a provenance/event ledger for how `_execute_validate_body()` produced the assessment: there is no visible persisted trace of model/policy version, source bundle identity beyond `code_fingerprint`, tool calls, decision checkpoints, or result-generation events. The current primitive makes the task lifecycle inspectable, but not yet the production path of the validation judgement itself.

Recommendation

Persist a compact run provenance ledger keyed by `run_id` outside the execution loop, recording the input/code fingerprint, policy/model version, relevant tool calls, decision checkpoints, and final result hash; link that ledger from `validation_history` while keeping the dashboard projection separate from raw internals.

P1

aligned

Design for delegation rather than direct manipulation

Delegation is represented explicitly: the handler accepts `params.task: {ttl}`, mints a `run_id`/`task_id` at start, returns `CreateTaskResult` immediately, and performs validation in background through `experimental.run_task`. The REST `cancel_task_on_validation_run` endpoint and `can_cancel` projection give the user a termination control instead of forcing them to wait on a blocking call.

P2

aligned

Ensure that background work remains perceptible

The previous ephemeral lifecycle is replaced with durable, perceptible primitives: `UserValidationRun.task_id` is a top-level recovery handle, `task_state` stores lifecycle JSON, `project_task_state()` projects it into dashboard-facing status, and `me.validation_history(run_id)` is described as the recovery path. The UI also renders `task_status` across history, projects, dashboard recent activity, and detail surfaces.

P?

aligned

Align feedback with the user’s level of attention

Feedback is proportionate: the primary UI consumes the compact `TaskStateProjection` with `task_status`, `status_message`, `terminal_reason`, and `can_cancel`, while routine background work is shown through badges/pills rather than full protocol internals. Terminal and exceptional states such as `failed`, `cancelled`, `task_state_parse_error`, and no-result cards escalate only when user attention is needed.

P4

aligned

Apply progressive disclosure to system agency

The code applies progressive disclosure by separating the raw task protocol state from the dashboard surface. `project_task_state()` strips `schema_version`, `idempotency_key`, `progress_token`, and `ttl_ms`; normal views show user-level lifecycle state, while deeper recovery/inspection happens through `me.validation_history(run_id)` and task result retrieval.

P5

aligned

Replace implied magic with clear mental models

The workflow replaces implied magic with named states and documented limits: `_PROJECTION_STATUS_MAP` maps protocol states like `working` and `input_required` into `active` and `awaiting_action`; cancelled/failed-without-assessment runs get explicit no-result handling; and the asymmetry that only `architect.validate` is task-augmented while `validate_consensus` and `certify` remain sync-only is disclosed on rendered surfaces.

P6

aligned

Expose meaningful operational state, not internal complexity

The user-facing state is meaningful rather than internal. `TaskStateProjection` exposes `task_status`, `status_message`, timestamps, `terminal_reason`, and `can_cancel`, while intentionally hiding protocol fields such as `idempotency_key`, `progress_token`, and `ttl_ms`. Frontend reads consume the projection only, avoiding client-side parsing of raw `task_state`.

P8

aligned

Make hand-offs, approvals, and blockers explicit

Hand-offs and blockers are explicit. Idempotency dedupe returns an existing `CreateTaskResult` and annotates `_meta.dedupe_hit` plus `_meta.dedupe_link` rather than silently swallowing duplicate work. The task worker raises on structured error envelopes so the SDK can mark the task failed instead of completed, and cancellation returns clear 400/404/409 cases in `cancel_task_on_validation_run`.

P9

aligned

Represent delegated work as a system, not merely as a conversation

Delegated validation work is represented as a system: `PgValidationTaskStore` backs tasks with database rows, `task_id` is unique-indexed, `task_state` persists lifecycle, tasks are queryable/listable/cancellable through task methods, and dashboard surfaces render lifecycle status independently of conversation text. This is a durable task model rather than an unstructured transcript.

P10

aligned

Optimise for steering, not only initiating

The code supports mid-run steering through cancellation. `cancel_task_on_validation_run()` exposes a user-facing cancel endpoint; `PgValidationTaskStore.update_task(..., status='cancelled')` writes terminal state and sets `row.cancel_requested = True`; terminal-to-nonterminal transitions are rejected with `is_terminal()`, and the UI uses `can_cancel` to show the control only while actionable. The documentation also honestly frames that in-flight provider calls may not be abortable, avoiding a false mental model of magical cancellation.

Embed in your README

Two embeddable variants: a small flat shield and a richer score card.

Score card (recommended)

Blueprint Readiness Score card
[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/2876f354-e141-41f3-9582-b413567b0f77/card.svg)](https://aidesignblueprint.com/en/readiness-review/2876f354-e141-41f3-9582-b413567b0f77)

Flat badge

Blueprint Readiness Score badge
[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/2876f354-e141-41f3-9582-b413567b0f77.svg)](https://aidesignblueprint.com/en/readiness-review/2876f354-e141-41f3-9582-b413567b0f77)
Baseline and iteration details
Rubric: 2026-05-04

Run ID: 2876f354-e141-41f3-9582-b413567b0f77 · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.