Valutazione completata; in attesa di revisione delle prove.
Valutato il 7 maggio 2026 rispetto alla doctrine di AI Design Blueprint
Emerging
Stato: High Risk
74/100
Voto C
This is an autonomous governed-agent workflow with several strong architectural primitives: a frozen typed policy envelope, durable run_id/state rows, a closed state graph, approval gates, steering commands, leases, and a redacted hash-chained audit trail with segregated evidence. It is not yet fully aligned because audit verification is still not bidirectional for all redacted evidence markers, and a lease-loss path during pause can let execution continue after the run has effectively been terminalized.
Storico iterazioni
3 run precedenti su questo artefatto. Ogni run_id apre la sua readiness review.
Findings per principio
10 principi valutati. Verdict, severity, evidenza e raccomandazione per ognuno.
P0
Optimise for steering, not only initiating
The steering primitives mostly exist, but one lease-loss path can break the pause boundary. `abort_run`, `pause_run`, and `resume_run` enqueue durable steering commands, and `GovernedRunHooks._drain_steering` processes them at checkpoints. However, `GovernedRunHooks._wait_for_resume` catches `LeaseLost` from `heartbeat(...)` and simply `return`s. If a paused run is reaped or the lease is stolen while waiting, the hook can return to the SDK as though pause completed; if the pause happened at `on_tool_start`, execution can proceed into `perform_action` / `GovernedActionDispatcher.dispatch` after the watchdog has terminalized the run. `GovernedActionDispatcher.dispatch` appends tool events and…
Raccomandazione
Do not suppress `LeaseLost` in `_wait_for_resume`; propagate it or convert it to `AbortRequested` so `Runner.run` stops before the next model/tool step. Also place a lease/state fence immediately before `action_executor(action)`—ideally by moving external action execution behind a small lease-aware action service that refuses work unless the run is still non-terminal and owned by the expected worker.
P0
Establish trust through inspectability
The audit trail is strong but still not fully bidirectional. `append_event` redacts sensitive fields into ledger markers and `_persist_evidence` stores raw sidecar rows; `verify_audit` now recomputes `sha256(ev.value_json)` for each existing evidence row and requires policy evidence at `(seq=1, field='policy_json')`. However, `verify_audit` only iterates `for ev in evidence_rows` and never scans every `event.data` redaction marker to require a corresponding `(run_id, seq, field)` sidecar row. Deleting evidence for non-policy redacted fields such as `target_text`, `action_payload`, `result`, or terminal `final_output` can therefore leave the chain markers intact while `verify_audit` has no re…
Raccomandazione
Make `verify_audit` bidirectional: derive the required `(seq, field, digest)` set from every redacted marker in every ledger event, require exactly one matching evidence row for each required marker, recompute the raw evidence digest, fail on missing or duplicate sidecar rows, and separately fail on orphan evidence rows not committed by the chain.
P0
Design for delegation rather than direct manipulation
Delegation is represented as an explicit bounded-authority contract rather than free-form direct manipulation: `GovernedPolicy` captures `task`, `instructions`, `max_turns`, `timeout_seconds`, `permitted_action_scope`, and `approval_gates`; `start_governed_run` rejects non-`GovernedPolicy` inputs; `GovernedActionDispatcher.dispatch` enforces `policy.is_action_permitted(...)` before invoking the executor; and `ScopeViolation` terminates out-of-scope attempts. Users initiate work by assigning intent and constraints, while pause/resume/abort and approval gates govern execution.
P0
Ensure that background work remains perceptible
Background execution is made perceptible through durable state and read-side surfaces: `create_run` inserts an `INITIALISED` row before the agent loop starts, `status(run_id)` exposes current state and latest message, `timeline(run_id)` exposes ordered events, `GovernedRunHooks._checkpoint` records heartbeat events, and `reap_stale_leases` transitions expired `INITIALISED`, `IN_PROGRESS`, `AWAITING_APPROVAL`, or `PAUSED` runs to `TIMED_OUT`. The run can be inspected after the initiating call returns because the run state is persisted in SQLite rather than only held in memory.
P0
Align feedback with the user’s level of attention
Feedback is calibrated by attention level: `status(...)` returns a concise `RunStatus` for lightweight monitoring; `timeline(...)` provides detailed event history only when requested; and the CLI `_stream_attention_required` emits only material attention states such as `AWAITING_APPROVAL`, `PAUSED`, `FAILED`, `TIMED_OUT`, and `ABORTED_BY_USER`, deduplicated by `(kind, seq)`. Routine heartbeats are kept in the timeline, while blockers and terminal states are escalated to stderr in the foreground run command.
P0
Apply progressive disclosure to system agency
The code uses progressive disclosure rather than dumping all internals into the primary surface. `status` exposes intent-relevant state, latest message, and terminal digests; `timeline` exposes the hash-chained operational ledger with sensitive fields redacted; `evidence` is a separate raw-payload sidecar; and `verify_audit` is an explicit diagnostic check. This cleanly separates summary, audit trail, and raw evidence access.
P0
Replace implied magic with clear mental models
The mental model is explicit in code and runtime structures: `RunState` is a closed enum with states like `INITIALISED`, `IN_PROGRESS`, `AWAITING_APPROVAL`, `PAUSED`, and terminal outcomes; `GovernedPolicy` declares permitted actions and approval gates; `build_governed_tools` describes the single `perform_action` tool and its permitted action names; and `UnsupportedHandoff` makes handoffs unsupported rather than implicit. The package also documents the correction model as abort-and-restart rather than live policy mutation.
P0
Expose meaningful operational state, not internal complexity
Operational state is exposed through meaningful lifecycle concepts instead of raw SDK internals. `RunStatus` surfaces `state`, `is_terminal`, `latest_message`, `final_output`, and `failure_reason`; `VALID_TRANSITIONS` constrains legal state movement; and CLI verbs map directly to user-relevant actions (`run`, `status`, `timeline`, `abort`, `pause`, `resume`, `reap`). Lower-level implementation details such as hash entries and evidence rows are reserved for `timeline`, `evidence`, and `verify_audit`.
P0
Make hand-offs, approvals, and blockers explicit
Approvals, blockers, and handoffs are explicit. `GovernedPolicy._require_unconditional_approval_for_submission_capable_scope` rejects `click`, `submit`, or `keypress` authority unless an unconditional `ApprovalGateRule` covers the action; `wait_for_decision` transitions the run to `AWAITING_APPROVAL` and waits for `resume` or `abort`; approval is bound to `compute_action_digest`; and `GovernedRunHooks.on_handoff` transitions to `FAILED` through `UnsupportedHandoff` rather than allowing hidden delegation. Out-of-scope actions are recorded as `scope_violation` and fail the run.
P0
Represent delegated work as a system, not merely as a conversation
Delegated work is represented as a structured system, not only as a chat transcript. Persistence is split across `runs`, `events`, `evidence`, `steering_commands`, and `run_lease_history`; execution state is governed by `RunState` and `VALID_TRANSITIONS`; the tool seam is centralized in `build_governed_tools`; and observability is provided through `status`, `timeline`, `evidence`, `verify_audit`, and `reap_stale_runs`. Conversation/model output is separated from run lifecycle, audit, evidence, and steering state.
Aggiungi al tuo README
Due varianti embeddabili: una piccola e una a card più ricca.
Score card (consigliata)
[](https://aidesignblueprint.com/en/readiness-review/37cc23be-e74b-40d4-8703-a9366ca98910)
Badge piatto
[](https://aidesignblueprint.com/en/readiness-review/37cc23be-e74b-40d4-8703-a9366ca98910)
Delta iterazione
Regressioni (1)
Run ID: 37cc23be-e74b-40d4-8703-a9366ca98910 · Results expire after 90 days
Run by agents. Governed by humans. Validated by the AI Design Blueprint.