Skip to main contentSkip to footer
Reviewed

Assessment complete; awaiting evidence revision.

Agent Architecture Review, Validation snapshot

Evaluated 7 May 2026 against the AI Design Blueprint doctrine

Emerging

Status: High Risk

74/100

Grade C

8 aligned2 production blockers
Per-principle verdicts

This is an autonomous governed-agent workflow with several strong architectural primitives: a frozen typed policy envelope, durable run_id/state rows, a closed state graph, approval gates, steering commands, leases, and a redacted hash-chained audit trail with segregated evidence. It is not yet fully aligned because audit verification is still not bidirectional for all redacted evidence markers, and a lease-loss path during pause can let execution continue after the run has effectively been terminalized.

Iteration history

3 prior runs on this artifact. Each run_id opens its own readiness review.

WhenScoreStatusRun ID
7 May 2026 (this run)74 / CHigh Risk37cc23be
7 May 202674 / CHigh Risk2067531c
7 May 202674 / CHigh Risk742680ee
7 May 202668 / CHigh Riske78f225f

Per-principle findings

10 principles evaluated. Verdict, severity, evidence and recommendation for each.

P0

Optimise for steering, not only initiating

needs changesproduction blocker70/100

The steering primitives mostly exist, but one lease-loss path can break the pause boundary. `abort_run`, `pause_run`, and `resume_run` enqueue durable steering commands, and `GovernedRunHooks._drain_steering` processes them at checkpoints. However, `GovernedRunHooks._wait_for_resume` catches `LeaseLost` from `heartbeat(...)` and simply `return`s. If a paused run is reaped or the lease is stolen while waiting, the hook can return to the SDK as though pause completed; if the pause happened at `on_tool_start`, execution can proceed into `perform_action` / `GovernedActionDispatcher.dispatch` after the watchdog has terminalized the run. `GovernedActionDispatcher.dispatch` appends tool events and…

Recommendation

Do not suppress `LeaseLost` in `_wait_for_resume`; propagate it or convert it to `AbortRequested` so `Runner.run` stops before the next model/tool step. Also place a lease/state fence immediately before `action_executor(action)`—ideally by moving external action execution behind a small lease-aware action service that refuses work unless the run is still non-terminal and owned by the expected worker.

P0

Establish trust through inspectability

needs changesproduction blocker55/100

The audit trail is strong but still not fully bidirectional. `append_event` redacts sensitive fields into ledger markers and `_persist_evidence` stores raw sidecar rows; `verify_audit` now recomputes `sha256(ev.value_json)` for each existing evidence row and requires policy evidence at `(seq=1, field='policy_json')`. However, `verify_audit` only iterates `for ev in evidence_rows` and never scans every `event.data` redaction marker to require a corresponding `(run_id, seq, field)` sidecar row. Deleting evidence for non-policy redacted fields such as `target_text`, `action_payload`, `result`, or terminal `final_output` can therefore leave the chain markers intact while `verify_audit` has no re…

Recommendation

Make `verify_audit` bidirectional: derive the required `(seq, field, digest)` set from every redacted marker in every ledger event, require exactly one matching evidence row for each required marker, recompute the raw evidence digest, fail on missing or duplicate sidecar rows, and separately fail on orphan evidence rows not committed by the chain.

P0

Design for delegation rather than direct manipulation

aligned

Delegation is represented as an explicit bounded-authority contract rather than free-form direct manipulation: `GovernedPolicy` captures `task`, `instructions`, `max_turns`, `timeout_seconds`, `permitted_action_scope`, and `approval_gates`; `start_governed_run` rejects non-`GovernedPolicy` inputs; `GovernedActionDispatcher.dispatch` enforces `policy.is_action_permitted(...)` before invoking the executor; and `ScopeViolation` terminates out-of-scope attempts. Users initiate work by assigning intent and constraints, while pause/resume/abort and approval gates govern execution.

P0

Ensure that background work remains perceptible

aligned

Background execution is made perceptible through durable state and read-side surfaces: `create_run` inserts an `INITIALISED` row before the agent loop starts, `status(run_id)` exposes current state and latest message, `timeline(run_id)` exposes ordered events, `GovernedRunHooks._checkpoint` records heartbeat events, and `reap_stale_leases` transitions expired `INITIALISED`, `IN_PROGRESS`, `AWAITING_APPROVAL`, or `PAUSED` runs to `TIMED_OUT`. The run can be inspected after the initiating call returns because the run state is persisted in SQLite rather than only held in memory.

P0

Align feedback with the user’s level of attention

aligned

Feedback is calibrated by attention level: `status(...)` returns a concise `RunStatus` for lightweight monitoring; `timeline(...)` provides detailed event history only when requested; and the CLI `_stream_attention_required` emits only material attention states such as `AWAITING_APPROVAL`, `PAUSED`, `FAILED`, `TIMED_OUT`, and `ABORTED_BY_USER`, deduplicated by `(kind, seq)`. Routine heartbeats are kept in the timeline, while blockers and terminal states are escalated to stderr in the foreground run command.

P0

Apply progressive disclosure to system agency

aligned

The code uses progressive disclosure rather than dumping all internals into the primary surface. `status` exposes intent-relevant state, latest message, and terminal digests; `timeline` exposes the hash-chained operational ledger with sensitive fields redacted; `evidence` is a separate raw-payload sidecar; and `verify_audit` is an explicit diagnostic check. This cleanly separates summary, audit trail, and raw evidence access.

P0

Replace implied magic with clear mental models

aligned

The mental model is explicit in code and runtime structures: `RunState` is a closed enum with states like `INITIALISED`, `IN_PROGRESS`, `AWAITING_APPROVAL`, `PAUSED`, and terminal outcomes; `GovernedPolicy` declares permitted actions and approval gates; `build_governed_tools` describes the single `perform_action` tool and its permitted action names; and `UnsupportedHandoff` makes handoffs unsupported rather than implicit. The package also documents the correction model as abort-and-restart rather than live policy mutation.

P0

Expose meaningful operational state, not internal complexity

aligned

Operational state is exposed through meaningful lifecycle concepts instead of raw SDK internals. `RunStatus` surfaces `state`, `is_terminal`, `latest_message`, `final_output`, and `failure_reason`; `VALID_TRANSITIONS` constrains legal state movement; and CLI verbs map directly to user-relevant actions (`run`, `status`, `timeline`, `abort`, `pause`, `resume`, `reap`). Lower-level implementation details such as hash entries and evidence rows are reserved for `timeline`, `evidence`, and `verify_audit`.

P0

Make hand-offs, approvals, and blockers explicit

aligned

Approvals, blockers, and handoffs are explicit. `GovernedPolicy._require_unconditional_approval_for_submission_capable_scope` rejects `click`, `submit`, or `keypress` authority unless an unconditional `ApprovalGateRule` covers the action; `wait_for_decision` transitions the run to `AWAITING_APPROVAL` and waits for `resume` or `abort`; approval is bound to `compute_action_digest`; and `GovernedRunHooks.on_handoff` transitions to `FAILED` through `UnsupportedHandoff` rather than allowing hidden delegation. Out-of-scope actions are recorded as `scope_violation` and fail the run.

P0

Represent delegated work as a system, not merely as a conversation

aligned

Delegated work is represented as a structured system, not only as a chat transcript. Persistence is split across `runs`, `events`, `evidence`, `steering_commands`, and `run_lease_history`; execution state is governed by `RunState` and `VALID_TRANSITIONS`; the tool seam is centralized in `build_governed_tools`; and observability is provided through `status`, `timeline`, `evidence`, `verify_audit`, and `reap_stale_runs`. Conversation/model output is separated from run lifecycle, audit, evidence, and steering state.

Embed in your README

Two embeddable variants: a small flat shield and a richer score card.

Score card (recommended)

Blueprint Readiness Score card
[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/37cc23be-e74b-40d4-8703-a9366ca98910/card.svg)](https://aidesignblueprint.com/en/readiness-review/37cc23be-e74b-40d4-8703-a9366ca98910)

Flat badge

Blueprint Readiness Score badge
[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/37cc23be-e74b-40d4-8703-a9366ca98910.svg)](https://aidesignblueprint.com/en/readiness-review/37cc23be-e74b-40d4-8703-a9366ca98910)
Baseline and iteration details
Baseline: usedDoctrine: same doctrineRace: checked clear

Iteration delta

Regressions (1)

P10Optimise for steering, not only initiatingalignedneeds_changes
Rubric: 2026-05-04Grade limited by 0 high-risk findings

Run ID: 37cc23be-e74b-40d4-8703-a9366ca98910 · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.