Reviewed

Valutazione completata; in attesa di revisione delle prove.

Agent Architecture Review, Snapshot di validazione

Valutato il 10 maggio 2026 rispetto alla doctrine di AI Design Blueprint

High Risk

Stato: High Risk

30/100

Voto F

3 allineati7 blocker produzione2 alto rischio

Verdetti per principio

The submission is clearly an autonomous payment workflow and includes several strong primitives: HMAC-bound approval envelopes, explicit run/task state, a durable inbox, a hash-chained ledger, explicit BANK_MODE handling, SUBMITTED_UNKNOWN, and separate SIMULATED_SUCCEEDED state. However, a load-bearing payment constraint still fails: newly created SUBMITTED_UNKNOWN bank handoffs are not counted against max_run_total_pence during the same execution pass, so the workflow can continue submitting additional payments even though the bank may already have accepted earlier ones. Reconciliation and bank-error semantics also remain incomplete for ambiguous transfer/status failures.

Storico iterazioni

5 run precedenti su questo artefatto. Ogni run_id apre la sua readiness review.

Quando	Score	Stato	Run ID
10 maggio 2026 (questa run)	30 / F	High Risk	5c1d5833…
10 mag 2026	40 / D	High Risk	b1195c34…
10 mag 2026	74 / C	Aligned	dd3a9348…
10 mag 2026	74 / C	Aligned	b4799966…
10 mag 2026	74 / C	Aligned	7e9bc0f6…
10 mag 2026	60 / C	High Risk	15aa9649…

Findings per principio

10 principi valutati. Verdict, severity, evidenza e raccomandazione per ognuno.

Make hand-offs, approvals, and blockers explicit

high riskproduction blocker92/100

Approvals are explicit and strong: `ApprovalEnvelope` is HMAC-bound to `run_id`, `task_id`, `policy_hash`, `invoice_snapshot_hash`, approver identity/role, and expiry, and `_submit_one()` refuses mock or unconfirmed responses as real success. The critical handoff failure is after the bank boundary: when `_submit_one()` returns `SUBMITTED_UNKNOWN`, execution does not stop and `run_total_submitted` is not incremented, so later approved tasks can still be submitted even though an earlier transfer may already be accepted by the bank. Also, `BankAPIError('transient'|'other')` is treated as `FAILED` rather than an explicit unknown-after-send blocker, and `reconcile_submitted_task()` lacks a broad…

Raccomandazione

Introduce a hard state break at the bank boundary: after any ambiguous submit result, record the task as `SUBMITTED_UNKNOWN`, reserve/count its amount against the run cap, append a blocker event, notify the inbox, and stop further submissions until reconciliation. Move both transfer and status checks behind a typed bank service boundary whose result envelope distinguishes `not_sent`, `rejected`, `accepted_unconfirmed`, `confirmed`, and `unknown_after_send`.

Design for delegation rather than direct manipulation

high riskproduction blocker90/100

The workflow supports delegation through `Policy`, `create_run()`, `draft_tasks()`, signed `ApprovalEnvelope`, and `execute_approved_payments()`, but the delegated constraint `Policy.max_run_total_pence` is not reliably enforced once a bank handoff becomes uncertain. `execute_approved_payments()` initializes `run_total_submitted` from existing `SUBMITTED`, `SUBMITTED_UNKNOWN`, and `SUCCEEDED` tasks, but after `_submit_one()` it increments the total only when `task.status == TaskStatus.SUCCEEDED`; a newly created `SUBMITTED_UNKNOWN` task is not counted before the loop proceeds to the next approved invoice. Because `SUBMITTED_UNKNOWN` explicitly means the bank may have accepted the request, th…

Raccomandazione

Treat every current-pass bank handoff that may have reached the bank—`SUBMITTED_UNKNOWN`, unconfirmed non-mock responses, and ambiguous transfer errors—as consuming the run budget immediately, and preferably pause/stop the execution loop until reconciliation. Architecturally, move the bank handoff behind a typed service result such as `not_sent`, `accepted_unconfirmed`, `confirmed`, `rejected`, or `unknown_after_send`, and enforce `max_run_total_pence` against the worst-case exposure, not only confirmed success.

Expose meaningful operational state, not internal complexity

needs changesproduction blocker65/100

The state model is mostly user-meaningful: `AWAITING_APPROVAL`, `PAUSED`, `SUBMITTED_UNKNOWN`, `SUCCEEDED`, `SIMULATED_SUCCEEDED`, `FAILED`, and `CANCELLED` are explicit, and `inspect_run()` exposes counts and task-level status. The remaining state gap is load-bearing: `_submit_one()` maps all `BankAPIError` values to `FAILED`, despite the declared `transient`/`other` categories being operationally different from a definite rejection. In addition, `execute_approved_payments()` does not treat a newly produced `SUBMITTED_UNKNOWN` as submitted exposure for the run-total projection, so the run’s operational state can understate money at risk during the same pass. Delta: this improves the prior P…

Raccomandazione

Make the operational state machine distinguish `definitely_failed_before_send`, `rejected_by_bank`, `accepted_unconfirmed`, and `unknown_after_send`. Update the execution projection immediately after each task so `SUBMITTED_UNKNOWN` contributes to exposure and blocks or pauses further submissions until reconciliation.

Align feedback with the user’s level of attention

needs changesproduction blocker60/100

The code calibrates many feedback paths well: forged approvals and role failures post critical inbox alerts, mock/unconfirmed bank responses post warnings, and `_submit_one()` posts critical alerts for broad adapter exceptions. However, `reconcile_submitted_task()` catches only `BankAPIError`; `check_transfer_status()` can raise `NotImplementedError` in live mode, and any unexpected status-client exception escapes without a `task.reconcile_blocked` ledger event or durable inbox alert. Separately, `_submit_one()` treats all `BankAPIError` kinds, including `transient` and `other`, as `TaskStatus.FAILED` with retry guidance, even though those classes may represent an ambiguous post-send failure…

Raccomandazione

Use the same hard feedback primitive for reconciliation as for submission: catch all status-check boundary exceptions, append a typed ledger event such as `task.reconcile_blocked`, post a critical inbox alert, and keep the task in `SUBMITTED_UNKNOWN`. Split `BankAPIError` or replace it with a typed bank result so only definitely-not-sent failures are shown as retryable failures.

Establish trust through inspectability

needs changesproduction blocker60/100

Inspectability is supported by a real primitive: `RunLedger` writes hash-chained JSONL events with `prev_event_hash`, `event_hash`, `sequence_no`, and `verify_chain()`, and most approval/bank paths append typed events. The gap is that not every load-bearing bank recovery path is auditable: `reconcile_submitted_task()` lets non-`BankAPIError` exceptions from `check_transfer_status()` escape without a ledger event, and the run-level `submitted_amount_pence` payload in `execute_approved_payments()` is based on `run_total_submitted`, which omits newly created `SUBMITTED_UNKNOWN` tasks during that pass. Raw task events preserve some evidence, but the summarized audit projection can be misleading…

Raccomandazione

Append a typed audit event for every reconciliation boundary failure before returning or raising, and derive run-level submitted/exposure totals from the full task projection after each submission rather than from a counter updated only on `SUCCEEDED`. For production hardening, anchor or separate the ledger from the execution process, but the immediate blocker is the missing/misleading audit record for ambiguous bank exposure.

Replace implied magic with clear mental models

needs changesproduction blocker55/100

The code substantially improves the mental model by using explicit `BANK_MODE`, mock tagging, `confirmed=True` for real success, `TaskStatus.SIMULATED_SUCCEEDED`, and a visible `simulated` field in `inspect_run()`. But the mental model still breaks for ambiguous bank errors: `BankAPIError.kind` explicitly includes `transient` and `other`, yet `_submit_one()` records any `BankAPIError` as `TaskStatus.FAILED` and tells the operator to use `retry_failed_task()`. For a bank call that may have reached the bank before timing out, `FAILED` communicates false certainty; the user-relevant state is unknown/pending reconciliation. Delta: this improves the prior high-risk mock-versus-real-success issue,…

Raccomandazione

Separate definitely-not-sent failures such as `mode_unset` and `auth_missing` from possibly-sent failures such as network timeout, transient, and unknown adapter errors. Ambiguous outcomes should become `SUBMITTED_UNKNOWN` with reconciliation instructions, not `FAILED` with retry instructions.

Optimise for steering, not only initiating

needs changesproduction blocker45/100

The workflow has meaningful steering primitives: `pause_run()`, `resume_run()`, `cancel_run()`, `retry_failed_task()`, `resume_after_failure()`, and `reconcile_submitted_task()`, and `execute_approved_payments()` rereads durable steering intent from `ledger.latest_steering_intent()` between tasks. The remaining issue is that steering state is not always reconciled immediately: `cancel_run()` sets `cancellation_requested=True` and cancels cancellable tasks but does not set `run.status = CANCELLED`, so a run can remain top-level `AWAITING_APPROVAL` until another execution/reconciliation pass. More importantly, after a task enters `SUBMITTED_UNKNOWN`, the workflow should steer itself into a blo…

Raccomandazione

After cancellation or any ambiguous bank handoff, immediately reconcile the run projection into a user-actionable state: `CANCELLED` when cancellation has taken effect, or a blocked/awaiting-reconciliation state when bank exposure is unknown. Do not require an additional execute pass to make top-level steering state truthful.

Ensure that background work remains perceptible

aligned

Background work is made perceptible through persistent run/task state (`RunStatus`, `TaskStatus`), JSONL-backed `RunLedger`, `DurableOperatorInbox`, `inspect_run()` summaries, critical alerts, `audit_chain_intact`, and explicit `SUBMITTED_UNKNOWN` / reconciliation counts. `_submit_one()` now records `task.submission_blocked`, `task.submitted_unconfirmed`, or `task.submitted_mock` rather than leaving the operator without a visible state. Delta: this improves the prior high-risk P2 finding by making the bank-handoff state visible and durable.

Apply progressive disclosure to system agency

aligned

The default inspection surface in `inspect_run()` gives a compact primary view—`status`, `bank_mode`, `critical_alerts`, `summary.by_status`, real versus simulated success counts, and `needs_reconciliation`—while the detailed task list and `RunLedger.read_run_events()` provide deeper inspection when needed. This separates summary state from diagnostic audit detail without forcing users to parse the full ledger for every routine check. Delta: this maintains the prior aligned result.

Represent delegated work as a system, not merely as a conversation

aligned

Delegated work is represented as a structured system rather than a conversation: `PaymentRun` and `PaymentTask` records are keyed by `run_id`/`task_id`, policy decisions and invoice snapshots are stored per task, state transitions are represented in enums, the ledger provides a timeline, and `inspect_run()` exposes task summaries plus detailed task objects. The workflow separates approval, execution, retry, reconciliation, and inspection rather than relying on a message stream. Delta: this maintains the prior aligned result.

Aggiungi al tuo README

Due varianti embeddabili: una piccola e una a card più ricca.

Score card (consigliata)

[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/5c1d5833-5f29-462f-b1a8-e774498c40fb/card.svg)](https://aidesignblueprint.com/en/readiness-review/5c1d5833-5f29-462f-b1a8-e774498c40fb)

Badge piatto

[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/5c1d5833-5f29-462f-b1a8-e774498c40fb.svg)](https://aidesignblueprint.com/en/readiness-review/5c1d5833-5f29-462f-b1a8-e774498c40fb)

Dettagli baseline e iterazione

Baseline: usedDoctrine: same doctrineRace: checked clear

Delta iterazione

Regressioni (2)

P1Design for delegation rather than direct manipulationalignedhigh_risk

P10Optimise for steering, not only initiatingalignedneeds_changes

Miglioramenti (4)

P2Ensure that background work remains perceptiblehigh_riskaligned

P3Align feedback with the user’s level of attentionneeds_changesneeds_changes

P5Replace implied magic with clear mental modelshigh_riskneeds_changes

P7Establish trust through inspectabilityneeds_changesneeds_changes

Rubric: 2026-05-04

Esegui la tua validazione AI Design Blueprint

Run ID: 5c1d5833-5f29-462f-b1a8-e774498c40fb · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.