Reviewed

Assessment complete; awaiting evidence revision.

Agent Architecture Review, Validation snapshot

Evaluated 7 May 2026 against the AI Design Blueprint doctrine

Emerging

Status: High Risk

68/100

Grade C

6 aligned3 production blockers1 hardening

Blueprint Readiness measures doctrine alignment, not runtime correctness. A production-ready verdict means the architecture embodies the 10 principles; it does not run your tests or types. Layer it on top of your test suite, not in place of it.

Per-principle verdicts

This is clearly an autonomous agentic workflow and it includes several strong governance primitives: a typed `GovernedPolicy`, a single `Runner.run` call site, enforced `RunState` transitions, durable steering commands, unconditional approval gates, action digests, and a hash-chained audit ledger. The main production blockers are lifecycle durability and sensitive-data handling: execution is still an in-process `asyncio.Task` that can strand runs after process death, and the audit trail records raw policy/action/result payloads that may contain form-fill secrets or PII.

Per-principle findings

10 principles evaluated. Verdict, severity, evidence and recommendation for each.

needs changesproduction blocker75/100

Establish trust through inspectability

Inspectability is strong structurally: `append_event` creates a hash-chained ledger, `run_genesis` commits the policy digest, tool calls and approval events are recorded, and `verify_audit` recomputes the chain. The blocker is that inspectability is not bounded for sensitive form-fill data: `start_governed_run` records full `policy_json`; `GovernedActionDispatcher.dispatch` logs raw `action_payload`, `target_text`, tool `result`, and error reprs; `transition` persists full `final_output` / `failure_reason`; and `timeline(...)` returns those fields directly. For a browser/form-fill agent, those payloads can contain credentials, PII, or regulated form contents.

Recommendation

Move sensitive payloads behind a redaction-aware audit boundary: keep the hash-chained ledger focused on typed summaries, rule IDs, action digests, and safe metadata, and store raw payloads only in an encrypted/redacted evidence store with explicit sensitivity classification and access control.

needs changesproduction blocker70/100

Ensure that background work remains perceptible

The code provides perceptibility during a healthy process via `status`, `timeline`, heartbeat events in `GovernedRunHooks._checkpoint`, and an immediate `runs` row from `create_run`. However, the actual execution lifecycle is not durable: `start_governed_run` uses `asyncio.create_task(...)` and stores the live task only in `GovernedRunHandle.outcome`; `timeout_seconds` is enforced only by in-memory `asyncio.wait_for` inside `_drive_agent_loop`. If the process/container dies, a run can remain `IN_PROGRESS` or `AWAITING_APPROVAL` forever with no worker lease, stale-heartbeat detector, recovery worker, or persisted terminal transition.

Recommendation

Move execution ownership to a durable worker/job primitive: persist a queued/running lease with worker identity and heartbeat timestamp, and run a separate watchdog/reaper that transitions expired leases to `TIMED_OUT` or a recoverable state. Keep `status` backed by that durable lifecycle rather than by an in-process task handle.

needs changesproduction blocker50/100

Align feedback with the user’s level of attention

The system mostly separates routine and attention-required feedback: heartbeats are written to the ledger, while `_stream_attention_required` surfaces `AWAITING_APPROVAL`, `PAUSED`, `FAILED`, and `TIMED_OUT`. But the CLI stream deduplicates by `RunState` using `seen.add(info.state)`, so a second or later `AWAITING_APPROVAL` transition in the same run will not be announced even though the agent is blocked for a new user decision.

Recommendation

Track and surface attention-required events by event sequence or transition identity, not only by state enum. Each new approval, pause, failure, or timeout event should produce a fresh operator-visible signal.

P10

needs changeshardening recommended35/100

Optimise for steering, not only initiating

The code has a real steering primitive: `abort_run`, `pause_run`, and `resume_run` enqueue durable commands; `claim_next_steering_command` gives `abort` priority; and `GovernedRunHooks._drain_steering` applies those commands at checkpoints. The remaining gap is that steering is interrupt/resume-only: the canonical `GovernedPolicy` is committed at genesis and there is no audited primitive for adding a correction, revising constraints, or reprioritising work while preserving continuity.

Recommendation

Add one narrow, audited policy-revision or operator-instruction command if live redirection is in scope; otherwise explicitly declare runs immutable after genesis and present abort/restart as the supported correction model. Avoid adding a routing wrapper around the agent loop.

aligned

Design for delegation rather than direct manipulation

Delegation is modeled through a bounded `GovernedPolicy` containing `task`, `instructions`, `max_turns`, `timeout_seconds`, `permitted_action_scope`, and `approval_gates`. Callers cannot inject an arbitrary `Agent` or tools: `_build_governed_agent` constructs the agent internally and `build_governed_tools` exposes only `perform_action`, while `start_governed_run` returns a `GovernedRunHandle` with a persistent `run_id`.

aligned

Apply progressive disclosure to system agency

The read side applies progressive disclosure cleanly: `status(...)` returns a compact `RunStatus` with state, latest message, final output, and failure reason; `timeline(...)` exposes detailed event data only when requested; and `verify_audit(...)` is a separate diagnostic operation. This separates primary operational understanding from deeper inspection.

aligned

Replace implied magic with clear mental models

The code replaces implied magic with explicit operating rules. `GovernedPolicy` makes scope, timeout, turn limits, and approval rules concrete; `SUBMISSION_CAPABLE_ACTIONS` names actions requiring approval; `build_governed_tools` tells the model the permitted action names; and `_build_governed_agent` sets `handoffs=[]`, making unsupported delegation explicit rather than implicit.

aligned

Expose meaningful operational state, not internal complexity

Operational state is represented with a closed `RunState` enum (`INITIALISED`, `IN_PROGRESS`, `AWAITING_APPROVAL`, `PAUSED`, `COMPLETE`, `FAILED`, `ABORTED_BY_USER`, `TIMED_OUT`) and an explicit `VALID_TRANSITIONS` graph. The persistence-layer `transition(...)` enforces legal transitions under `BEGIN IMMEDIATE`, and `status(...)` exposes user-relevant state rather than raw SDK internals.

aligned

Make hand-offs, approvals, and blockers explicit

Approvals and blockers are explicit at the execution boundary. `GovernedPolicy._require_unconditional_approval_for_submission_capable_scope` rejects `click`, `submit`, or `keypress` scope without an unconditional `ApprovalGateRule`; `wait_for_decision` transitions to `RunState.AWAITING_APPROVAL` with action details and rule IDs; `resume` records approval, `abort` records decline/abort, and `ApprovalBindingMismatch` prevents execution if the approved action digest differs from the executable action. Unsupported handoffs are failed explicitly in `GovernedRunHooks.on_handoff`.

aligned

Represent delegated work as a system, not merely as a conversation

Delegated work is represented as a structured system, not merely a conversation. Persistence separates `runs`, `events`, and `steering_commands`; `RunState` separates execution state from chat output; `timeline(...)` exposes ordered operational history; and tools are mediated through `GovernedActionDispatcher` rather than free-form conversational instructions.

Embed in your README

Two embeddable variants: a small flat shield and a richer score card.

Score card (recommended)

[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/e78f225f-32ce-4426-a90b-4143212302be/card.svg)](https://aidesignblueprint.com/en/readiness-review/e78f225f-32ce-4426-a90b-4143212302be)

Flat badge

[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/e78f225f-32ce-4426-a90b-4143212302be.svg)](https://aidesignblueprint.com/en/readiness-review/e78f225f-32ce-4426-a90b-4143212302be)

Baseline and iteration details

Rubric: 2026-05-04

Run your own validation AI Design Blueprint

Run ID: e78f225f-32ce-4426-a90b-4143212302be · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.