Reviewed

Assessment complete; awaiting evidence revision.

Agent Architecture Review, Validation snapshot

Evaluated 7 May 2026 against the AI Design Blueprint doctrine

Emerging

Status: High Risk

74/100

Grade C

8 aligned2 production blockers

Blueprint Readiness measures doctrine alignment, not runtime correctness. A production-ready verdict means the architecture embodies the 10 principles; it does not run your tests or types. Layer it on top of your test suite, not in place of it.

Per-principle verdicts

This is an autonomous governed-agent workflow with several strong architectural primitives: a frozen typed policy envelope, durable run_id/state rows, a closed state graph, approval gates, steering commands, leases, and a redacted hash-chained audit trail with segregated evidence. It is not yet fully aligned because audit verification is still not bidirectional for all redacted evidence markers, and a lease-loss path during pause can let execution continue after the run has effectively been terminalized.

Per-principle findings

10 principles evaluated. Verdict, severity, evidence and recommendation for each.

P10

needs changesproduction blocker70/100

Optimise for steering, not only initiating

The steering primitives mostly exist, but one lease-loss path can break the pause boundary. `abort_run`, `pause_run`, and `resume_run` enqueue durable steering commands, and `GovernedRunHooks._drain_steering` processes them at checkpoints. However, `GovernedRunHooks._wait_for_resume` catches `LeaseLost` from `heartbeat(...)` and simply `return`s. If a paused run is reaped or the lease is stolen while waiting, the hook can return to the SDK as though pause completed; if the pause happened at `on_tool_start`, execution can proceed into `perform_action` / `GovernedActionDispatcher.dispatch` after the watchdog has terminalized the run. `GovernedActionDispatcher.dispatch` appends tool events and…

Recommendation

Do not suppress `LeaseLost` in `_wait_for_resume`; propagate it or convert it to `AbortRequested` so `Runner.run` stops before the next model/tool step. Also place a lease/state fence immediately before `action_executor(action)`—ideally by moving external action execution behind a small lease-aware action service that refuses work unless the run is still non-terminal and owned by the expected worker.

needs changesproduction blocker55/100

Establish trust through inspectability

The audit trail is strong but still not fully bidirectional. `append_event` redacts sensitive fields into ledger markers and `_persist_evidence` stores raw sidecar rows; `verify_audit` now recomputes `sha256(ev.value_json)` for each existing evidence row and requires policy evidence at `(seq=1, field='policy_json')`. However, `verify_audit` only iterates `for ev in evidence_rows` and never scans every `event.data` redaction marker to require a corresponding `(run_id, seq, field)` sidecar row. Deleting evidence for non-policy redacted fields such as `target_text`, `action_payload`, `result`, or terminal `final_output` can therefore leave the chain markers intact while `verify_audit` has no re…

Recommendation

Make `verify_audit` bidirectional: derive the required `(seq, field, digest)` set from every redacted marker in every ledger event, require exactly one matching evidence row for each required marker, recompute the raw evidence digest, fail on missing or duplicate sidecar rows, and separately fail on orphan evidence rows not committed by the chain.

aligned

Design for delegation rather than direct manipulation

Delegation is represented as an explicit bounded-authority contract rather than free-form direct manipulation: `GovernedPolicy` captures `task`, `instructions`, `max_turns`, `timeout_seconds`, `permitted_action_scope`, and `approval_gates`; `start_governed_run` rejects non-`GovernedPolicy` inputs; `GovernedActionDispatcher.dispatch` enforces `policy.is_action_permitted(...)` before invoking the executor; and `ScopeViolation` terminates out-of-scope attempts. Users initiate work by assigning intent and constraints, while pause/resume/abort and approval gates govern execution.

aligned

Ensure that background work remains perceptible

Background execution is made perceptible through durable state and read-side surfaces: `create_run` inserts an `INITIALISED` row before the agent loop starts, `status(run_id)` exposes current state and latest message, `timeline(run_id)` exposes ordered events, `GovernedRunHooks._checkpoint` records heartbeat events, and `reap_stale_leases` transitions expired `INITIALISED`, `IN_PROGRESS`, `AWAITING_APPROVAL`, or `PAUSED` runs to `TIMED_OUT`. The run can be inspected after the initiating call returns because the run state is persisted in SQLite rather than only held in memory.

aligned

Align feedback with the user’s level of attention

Feedback is calibrated by attention level: `status(...)` returns a concise `RunStatus` for lightweight monitoring; `timeline(...)` provides detailed event history only when requested; and the CLI `_stream_attention_required` emits only material attention states such as `AWAITING_APPROVAL`, `PAUSED`, `FAILED`, `TIMED_OUT`, and `ABORTED_BY_USER`, deduplicated by `(kind, seq)`. Routine heartbeats are kept in the timeline, while blockers and terminal states are escalated to stderr in the foreground run command.

aligned

Apply progressive disclosure to system agency

The code uses progressive disclosure rather than dumping all internals into the primary surface. `status` exposes intent-relevant state, latest message, and terminal digests; `timeline` exposes the hash-chained operational ledger with sensitive fields redacted; `evidence` is a separate raw-payload sidecar; and `verify_audit` is an explicit diagnostic check. This cleanly separates summary, audit trail, and raw evidence access.

aligned

Replace implied magic with clear mental models

The mental model is explicit in code and runtime structures: `RunState` is a closed enum with states like `INITIALISED`, `IN_PROGRESS`, `AWAITING_APPROVAL`, `PAUSED`, and terminal outcomes; `GovernedPolicy` declares permitted actions and approval gates; `build_governed_tools` describes the single `perform_action` tool and its permitted action names; and `UnsupportedHandoff` makes handoffs unsupported rather than implicit. The package also documents the correction model as abort-and-restart rather than live policy mutation.

aligned

Expose meaningful operational state, not internal complexity

Operational state is exposed through meaningful lifecycle concepts instead of raw SDK internals. `RunStatus` surfaces `state`, `is_terminal`, `latest_message`, `final_output`, and `failure_reason`; `VALID_TRANSITIONS` constrains legal state movement; and CLI verbs map directly to user-relevant actions (`run`, `status`, `timeline`, `abort`, `pause`, `resume`, `reap`). Lower-level implementation details such as hash entries and evidence rows are reserved for `timeline`, `evidence`, and `verify_audit`.

aligned

Make hand-offs, approvals, and blockers explicit

Approvals, blockers, and handoffs are explicit. `GovernedPolicy._require_unconditional_approval_for_submission_capable_scope` rejects `click`, `submit`, or `keypress` authority unless an unconditional `ApprovalGateRule` covers the action; `wait_for_decision` transitions the run to `AWAITING_APPROVAL` and waits for `resume` or `abort`; approval is bound to `compute_action_digest`; and `GovernedRunHooks.on_handoff` transitions to `FAILED` through `UnsupportedHandoff` rather than allowing hidden delegation. Out-of-scope actions are recorded as `scope_violation` and fail the run.

aligned

Represent delegated work as a system, not merely as a conversation

Delegated work is represented as a structured system, not only as a chat transcript. Persistence is split across `runs`, `events`, `evidence`, `steering_commands`, and `run_lease_history`; execution state is governed by `RunState` and `VALID_TRANSITIONS`; the tool seam is centralized in `build_governed_tools`; and observability is provided through `status`, `timeline`, `evidence`, `verify_audit`, and `reap_stale_runs`. Conversation/model output is separated from run lifecycle, audit, evidence, and steering state.

Embed in your README

Two embeddable variants: a small flat shield and a richer score card.

Score card (recommended)

[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/37cc23be-e74b-40d4-8703-a9366ca98910/card.svg)](https://aidesignblueprint.com/en/readiness-review/37cc23be-e74b-40d4-8703-a9366ca98910)

Flat badge

[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/37cc23be-e74b-40d4-8703-a9366ca98910.svg)](https://aidesignblueprint.com/en/readiness-review/37cc23be-e74b-40d4-8703-a9366ca98910)

Baseline and iteration details

Baseline: usedDoctrine: same doctrineRace: checked clear

Iteration delta

0 closed this pass1 reopened0 high-risk findings still open

Regressions (1)

P10Optimise for steering, not only initiatingalignedneeds_changes

Rubric: 2026-05-04Grade limited by 0 high-risk findings

Run your own validation AI Design Blueprint

Run ID: 37cc23be-e74b-40d4-8703-a9366ca98910 · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.