Reviewed

Valutazione completata; in attesa di revisione delle prove.

Agent Architecture Review, Snapshot di validazione

Valutato il 7 maggio 2026 rispetto alla doctrine di AI Design Blueprint

Emerging

Stato: High Risk

74/100

Voto C

8 allineati2 blocker produzione

Verdetti per principio

This is an autonomous governed-agent workflow with several strong architectural primitives: a frozen typed policy envelope, durable run_id/state rows, a closed state graph, approval gates, steering commands, leases, and a redacted hash-chained audit trail with segregated evidence. It is not yet fully aligned because audit verification is still not bidirectional for all redacted evidence markers, and a lease-loss path during pause can let execution continue after the run has effectively been terminalized.

Storico iterazioni

3 run precedenti su questo artefatto. Ogni run_id apre la sua readiness review.

Quando	Score	Stato	Run ID
7 maggio 2026 (questa run)	74 / C	High Risk	37cc23be…
7 mag 2026	74 / C	High Risk	2067531c…
7 mag 2026	74 / C	High Risk	742680ee…
7 mag 2026	68 / C	High Risk	e78f225f…

Findings per principio

10 principi valutati. Verdict, severity, evidenza e raccomandazione per ognuno.

Optimise for steering, not only initiating

needs changesproduction blocker70/100

The steering primitives mostly exist, but one lease-loss path can break the pause boundary. `abort_run`, `pause_run`, and `resume_run` enqueue durable steering commands, and `GovernedRunHooks._drain_steering` processes them at checkpoints. However, `GovernedRunHooks._wait_for_resume` catches `LeaseLost` from `heartbeat(...)` and simply `return`s. If a paused run is reaped or the lease is stolen while waiting, the hook can return to the SDK as though pause completed; if the pause happened at `on_tool_start`, execution can proceed into `perform_action` / `GovernedActionDispatcher.dispatch` after the watchdog has terminalized the run. `GovernedActionDispatcher.dispatch` appends tool events and…

Raccomandazione

Do not suppress `LeaseLost` in `_wait_for_resume`; propagate it or convert it to `AbortRequested` so `Runner.run` stops before the next model/tool step. Also place a lease/state fence immediately before `action_executor(action)`—ideally by moving external action execution behind a small lease-aware action service that refuses work unless the run is still non-terminal and owned by the expected worker.

Establish trust through inspectability

needs changesproduction blocker55/100

The audit trail is strong but still not fully bidirectional. `append_event` redacts sensitive fields into ledger markers and `_persist_evidence` stores raw sidecar rows; `verify_audit` now recomputes `sha256(ev.value_json)` for each existing evidence row and requires policy evidence at `(seq=1, field='policy_json')`. However, `verify_audit` only iterates `for ev in evidence_rows` and never scans every `event.data` redaction marker to require a corresponding `(run_id, seq, field)` sidecar row. Deleting evidence for non-policy redacted fields such as `target_text`, `action_payload`, `result`, or terminal `final_output` can therefore leave the chain markers intact while `verify_audit` has no re…

Raccomandazione

Make `verify_audit` bidirectional: derive the required `(seq, field, digest)` set from every redacted marker in every ledger event, require exactly one matching evidence row for each required marker, recompute the raw evidence digest, fail on missing or duplicate sidecar rows, and separately fail on orphan evidence rows not committed by the chain.

Design for delegation rather than direct manipulation

aligned

Delegation is represented as an explicit bounded-authority contract rather than free-form direct manipulation: `GovernedPolicy` captures `task`, `instructions`, `max_turns`, `timeout_seconds`, `permitted_action_scope`, and `approval_gates`; `start_governed_run` rejects non-`GovernedPolicy` inputs; `GovernedActionDispatcher.dispatch` enforces `policy.is_action_permitted(...)` before invoking the executor; and `ScopeViolation` terminates out-of-scope attempts. Users initiate work by assigning intent and constraints, while pause/resume/abort and approval gates govern execution.

Ensure that background work remains perceptible

aligned

Background execution is made perceptible through durable state and read-side surfaces: `create_run` inserts an `INITIALISED` row before the agent loop starts, `status(run_id)` exposes current state and latest message, `timeline(run_id)` exposes ordered events, `GovernedRunHooks._checkpoint` records heartbeat events, and `reap_stale_leases` transitions expired `INITIALISED`, `IN_PROGRESS`, `AWAITING_APPROVAL`, or `PAUSED` runs to `TIMED_OUT`. The run can be inspected after the initiating call returns because the run state is persisted in SQLite rather than only held in memory.

Align feedback with the user’s level of attention

aligned

Feedback is calibrated by attention level: `status(...)` returns a concise `RunStatus` for lightweight monitoring; `timeline(...)` provides detailed event history only when requested; and the CLI `_stream_attention_required` emits only material attention states such as `AWAITING_APPROVAL`, `PAUSED`, `FAILED`, `TIMED_OUT`, and `ABORTED_BY_USER`, deduplicated by `(kind, seq)`. Routine heartbeats are kept in the timeline, while blockers and terminal states are escalated to stderr in the foreground run command.

Apply progressive disclosure to system agency

aligned

The code uses progressive disclosure rather than dumping all internals into the primary surface. `status` exposes intent-relevant state, latest message, and terminal digests; `timeline` exposes the hash-chained operational ledger with sensitive fields redacted; `evidence` is a separate raw-payload sidecar; and `verify_audit` is an explicit diagnostic check. This cleanly separates summary, audit trail, and raw evidence access.

Replace implied magic with clear mental models

aligned

The mental model is explicit in code and runtime structures: `RunState` is a closed enum with states like `INITIALISED`, `IN_PROGRESS`, `AWAITING_APPROVAL`, `PAUSED`, and terminal outcomes; `GovernedPolicy` declares permitted actions and approval gates; `build_governed_tools` describes the single `perform_action` tool and its permitted action names; and `UnsupportedHandoff` makes handoffs unsupported rather than implicit. The package also documents the correction model as abort-and-restart rather than live policy mutation.

Expose meaningful operational state, not internal complexity

aligned

Operational state is exposed through meaningful lifecycle concepts instead of raw SDK internals. `RunStatus` surfaces `state`, `is_terminal`, `latest_message`, `final_output`, and `failure_reason`; `VALID_TRANSITIONS` constrains legal state movement; and CLI verbs map directly to user-relevant actions (`run`, `status`, `timeline`, `abort`, `pause`, `resume`, `reap`). Lower-level implementation details such as hash entries and evidence rows are reserved for `timeline`, `evidence`, and `verify_audit`.

Make hand-offs, approvals, and blockers explicit

aligned

Approvals, blockers, and handoffs are explicit. `GovernedPolicy._require_unconditional_approval_for_submission_capable_scope` rejects `click`, `submit`, or `keypress` authority unless an unconditional `ApprovalGateRule` covers the action; `wait_for_decision` transitions the run to `AWAITING_APPROVAL` and waits for `resume` or `abort`; approval is bound to `compute_action_digest`; and `GovernedRunHooks.on_handoff` transitions to `FAILED` through `UnsupportedHandoff` rather than allowing hidden delegation. Out-of-scope actions are recorded as `scope_violation` and fail the run.

Represent delegated work as a system, not merely as a conversation

aligned

Delegated work is represented as a structured system, not only as a chat transcript. Persistence is split across `runs`, `events`, `evidence`, `steering_commands`, and `run_lease_history`; execution state is governed by `RunState` and `VALID_TRANSITIONS`; the tool seam is centralized in `build_governed_tools`; and observability is provided through `status`, `timeline`, `evidence`, `verify_audit`, and `reap_stale_runs`. Conversation/model output is separated from run lifecycle, audit, evidence, and steering state.

Aggiungi al tuo README

Due varianti embeddabili: una piccola e una a card più ricca.

Score card (consigliata)

[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/37cc23be-e74b-40d4-8703-a9366ca98910/card.svg)](https://aidesignblueprint.com/en/readiness-review/37cc23be-e74b-40d4-8703-a9366ca98910)

Badge piatto

[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/37cc23be-e74b-40d4-8703-a9366ca98910.svg)](https://aidesignblueprint.com/en/readiness-review/37cc23be-e74b-40d4-8703-a9366ca98910)

Dettagli baseline e iterazione

Baseline: usedDoctrine: same doctrineRace: checked clear

Delta iterazione

Regressioni (1)

P10Optimise for steering, not only initiatingalignedneeds_changes

Rubric: 2026-05-04Voto limitato da 0 finding ad alto rischio

Run your own validation AI Design Blueprint

Run ID: 37cc23be-e74b-40d4-8703-a9366ca98910 · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.