Assessment complete; awaiting evidence revision.
Evaluated 10 May 2026 against the AI Design Blueprint doctrine
High Risk
Status: High Risk
30/100
Grade F
The submission is clearly an autonomous payment workflow and includes several strong primitives: HMAC-bound approval envelopes, explicit run/task state, a durable inbox, a hash-chained ledger, explicit BANK_MODE handling, SUBMITTED_UNKNOWN, and separate SIMULATED_SUCCEEDED state. However, a load-bearing payment constraint still fails: newly created SUBMITTED_UNKNOWN bank handoffs are not counted against max_run_total_pence during the same execution pass, so the workflow can continue submitting additional payments even though the bank may already have accepted earlier ones. Reconciliation and bank-error semantics also remain incomplete for ambiguous transfer/status failures.
Iteration history
5 prior runs on this artifact. Each run_id opens its own readiness review.
Per-principle findings
10 principles evaluated. Verdict, severity, evidence and recommendation for each.
P0
Make hand-offs, approvals, and blockers explicit
Approvals are explicit and strong: `ApprovalEnvelope` is HMAC-bound to `run_id`, `task_id`, `policy_hash`, `invoice_snapshot_hash`, approver identity/role, and expiry, and `_submit_one()` refuses mock or unconfirmed responses as real success. The critical handoff failure is after the bank boundary: when `_submit_one()` returns `SUBMITTED_UNKNOWN`, execution does not stop and `run_total_submitted` is not incremented, so later approved tasks can still be submitted even though an earlier transfer may already be accepted by the bank. Also, `BankAPIError('transient'|'other')` is treated as `FAILED` rather than an explicit unknown-after-send blocker, and `reconcile_submitted_task()` lacks a broad…
Recommendation
Introduce a hard state break at the bank boundary: after any ambiguous submit result, record the task as `SUBMITTED_UNKNOWN`, reserve/count its amount against the run cap, append a blocker event, notify the inbox, and stop further submissions until reconciliation. Move both transfer and status checks behind a typed bank service boundary whose result envelope distinguishes `not_sent`, `rejected`, `accepted_unconfirmed`, `confirmed`, and `unknown_after_send`.
P0
Design for delegation rather than direct manipulation
The workflow supports delegation through `Policy`, `create_run()`, `draft_tasks()`, signed `ApprovalEnvelope`, and `execute_approved_payments()`, but the delegated constraint `Policy.max_run_total_pence` is not reliably enforced once a bank handoff becomes uncertain. `execute_approved_payments()` initializes `run_total_submitted` from existing `SUBMITTED`, `SUBMITTED_UNKNOWN`, and `SUCCEEDED` tasks, but after `_submit_one()` it increments the total only when `task.status == TaskStatus.SUCCEEDED`; a newly created `SUBMITTED_UNKNOWN` task is not counted before the loop proceeds to the next approved invoice. Because `SUBMITTED_UNKNOWN` explicitly means the bank may have accepted the request, th…
Recommendation
Treat every current-pass bank handoff that may have reached the bank—`SUBMITTED_UNKNOWN`, unconfirmed non-mock responses, and ambiguous transfer errors—as consuming the run budget immediately, and preferably pause/stop the execution loop until reconciliation. Architecturally, move the bank handoff behind a typed service result such as `not_sent`, `accepted_unconfirmed`, `confirmed`, `rejected`, or `unknown_after_send`, and enforce `max_run_total_pence` against the worst-case exposure, not only confirmed success.
P0
Expose meaningful operational state, not internal complexity
The state model is mostly user-meaningful: `AWAITING_APPROVAL`, `PAUSED`, `SUBMITTED_UNKNOWN`, `SUCCEEDED`, `SIMULATED_SUCCEEDED`, `FAILED`, and `CANCELLED` are explicit, and `inspect_run()` exposes counts and task-level status. The remaining state gap is load-bearing: `_submit_one()` maps all `BankAPIError` values to `FAILED`, despite the declared `transient`/`other` categories being operationally different from a definite rejection. In addition, `execute_approved_payments()` does not treat a newly produced `SUBMITTED_UNKNOWN` as submitted exposure for the run-total projection, so the run’s operational state can understate money at risk during the same pass. Delta: this improves the prior P…
Recommendation
Make the operational state machine distinguish `definitely_failed_before_send`, `rejected_by_bank`, `accepted_unconfirmed`, and `unknown_after_send`. Update the execution projection immediately after each task so `SUBMITTED_UNKNOWN` contributes to exposure and blocks or pauses further submissions until reconciliation.
P0
Align feedback with the user’s level of attention
The code calibrates many feedback paths well: forged approvals and role failures post critical inbox alerts, mock/unconfirmed bank responses post warnings, and `_submit_one()` posts critical alerts for broad adapter exceptions. However, `reconcile_submitted_task()` catches only `BankAPIError`; `check_transfer_status()` can raise `NotImplementedError` in live mode, and any unexpected status-client exception escapes without a `task.reconcile_blocked` ledger event or durable inbox alert. Separately, `_submit_one()` treats all `BankAPIError` kinds, including `transient` and `other`, as `TaskStatus.FAILED` with retry guidance, even though those classes may represent an ambiguous post-send failure…
Recommendation
Use the same hard feedback primitive for reconciliation as for submission: catch all status-check boundary exceptions, append a typed ledger event such as `task.reconcile_blocked`, post a critical inbox alert, and keep the task in `SUBMITTED_UNKNOWN`. Split `BankAPIError` or replace it with a typed bank result so only definitely-not-sent failures are shown as retryable failures.
P0
Establish trust through inspectability
Inspectability is supported by a real primitive: `RunLedger` writes hash-chained JSONL events with `prev_event_hash`, `event_hash`, `sequence_no`, and `verify_chain()`, and most approval/bank paths append typed events. The gap is that not every load-bearing bank recovery path is auditable: `reconcile_submitted_task()` lets non-`BankAPIError` exceptions from `check_transfer_status()` escape without a ledger event, and the run-level `submitted_amount_pence` payload in `execute_approved_payments()` is based on `run_total_submitted`, which omits newly created `SUBMITTED_UNKNOWN` tasks during that pass. Raw task events preserve some evidence, but the summarized audit projection can be misleading…
Recommendation
Append a typed audit event for every reconciliation boundary failure before returning or raising, and derive run-level submitted/exposure totals from the full task projection after each submission rather than from a counter updated only on `SUCCEEDED`. For production hardening, anchor or separate the ledger from the execution process, but the immediate blocker is the missing/misleading audit record for ambiguous bank exposure.
P0
Replace implied magic with clear mental models
The code substantially improves the mental model by using explicit `BANK_MODE`, mock tagging, `confirmed=True` for real success, `TaskStatus.SIMULATED_SUCCEEDED`, and a visible `simulated` field in `inspect_run()`. But the mental model still breaks for ambiguous bank errors: `BankAPIError.kind` explicitly includes `transient` and `other`, yet `_submit_one()` records any `BankAPIError` as `TaskStatus.FAILED` and tells the operator to use `retry_failed_task()`. For a bank call that may have reached the bank before timing out, `FAILED` communicates false certainty; the user-relevant state is unknown/pending reconciliation. Delta: this improves the prior high-risk mock-versus-real-success issue,…
Recommendation
Separate definitely-not-sent failures such as `mode_unset` and `auth_missing` from possibly-sent failures such as network timeout, transient, and unknown adapter errors. Ambiguous outcomes should become `SUBMITTED_UNKNOWN` with reconciliation instructions, not `FAILED` with retry instructions.
P0
Optimise for steering, not only initiating
The workflow has meaningful steering primitives: `pause_run()`, `resume_run()`, `cancel_run()`, `retry_failed_task()`, `resume_after_failure()`, and `reconcile_submitted_task()`, and `execute_approved_payments()` rereads durable steering intent from `ledger.latest_steering_intent()` between tasks. The remaining issue is that steering state is not always reconciled immediately: `cancel_run()` sets `cancellation_requested=True` and cancels cancellable tasks but does not set `run.status = CANCELLED`, so a run can remain top-level `AWAITING_APPROVAL` until another execution/reconciliation pass. More importantly, after a task enters `SUBMITTED_UNKNOWN`, the workflow should steer itself into a blo…
Recommendation
After cancellation or any ambiguous bank handoff, immediately reconcile the run projection into a user-actionable state: `CANCELLED` when cancellation has taken effect, or a blocked/awaiting-reconciliation state when bank exposure is unknown. Do not require an additional execute pass to make top-level steering state truthful.
P0
Ensure that background work remains perceptible
Background work is made perceptible through persistent run/task state (`RunStatus`, `TaskStatus`), JSONL-backed `RunLedger`, `DurableOperatorInbox`, `inspect_run()` summaries, critical alerts, `audit_chain_intact`, and explicit `SUBMITTED_UNKNOWN` / reconciliation counts. `_submit_one()` now records `task.submission_blocked`, `task.submitted_unconfirmed`, or `task.submitted_mock` rather than leaving the operator without a visible state. Delta: this improves the prior high-risk P2 finding by making the bank-handoff state visible and durable.
P0
Apply progressive disclosure to system agency
The default inspection surface in `inspect_run()` gives a compact primary view—`status`, `bank_mode`, `critical_alerts`, `summary.by_status`, real versus simulated success counts, and `needs_reconciliation`—while the detailed task list and `RunLedger.read_run_events()` provide deeper inspection when needed. This separates summary state from diagnostic audit detail without forcing users to parse the full ledger for every routine check. Delta: this maintains the prior aligned result.
P0
Represent delegated work as a system, not merely as a conversation
Delegated work is represented as a structured system rather than a conversation: `PaymentRun` and `PaymentTask` records are keyed by `run_id`/`task_id`, policy decisions and invoice snapshots are stored per task, state transitions are represented in enums, the ledger provides a timeline, and `inspect_run()` exposes task summaries plus detailed task objects. The workflow separates approval, execution, retry, reconciliation, and inspection rather than relying on a message stream. Delta: this maintains the prior aligned result.
Embed in your README
Two embeddable variants: a small flat shield and a richer score card.
Score card (recommended)
[](https://aidesignblueprint.com/en/readiness-review/5c1d5833-5f29-462f-b1a8-e774498c40fb)
Flat badge
[](https://aidesignblueprint.com/en/readiness-review/5c1d5833-5f29-462f-b1a8-e774498c40fb)
Iteration delta
Regressions (2)
Improvements (4)
Run ID: 5c1d5833-5f29-462f-b1a8-e774498c40fb · Results expire after 90 days
Run by agents. Governed by humans. Validated by the AI Design Blueprint.