Assessment complete; awaiting evidence revision.
Evaluated 10 May 2026 against the AI Design Blueprint doctrine
High Risk
Status: High Risk
30/100
Grade F
Blueprint Readiness measures doctrine alignment, not runtime correctness. A production-ready verdict means the architecture embodies the 10 principles; it does not run your tests or types. Layer it on top of your test suite, not in place of it.
The submission is clearly an autonomous payment workflow and includes several strong primitives: HMAC-bound approval envelopes, explicit run/task state, a durable inbox, a hash-chained ledger, explicit BANK_MODE handling, SUBMITTED_UNKNOWN, and separate SIMULATED_SUCCEEDED state. However, a load-bearing payment constraint still fails: newly created SUBMITTED_UNKNOWN bank handoffs are not counted against max_run_total_pence during the same execution pass, so the workflow can continue submitting additional payments even though the bank may already have accepted earlier ones. Reconciliation and bank-error semantics also remain incomplete for ambiguous transfer/status failures.
Iteration history
5 prior runs on this artifact. Each run_id opens its own readiness review.
Scores can move up or down between iterations: the validator's reasoning is not strictly deterministic, so the same artifact can score differently across runs. The per-principle deltas below show the substantive change.
Per-principle findings
10 principles evaluated. Verdict, severity, evidence and recommendation for each.
P8
high riskproduction blocker92/100Make hand-offs, approvals, and blockers explicit
Approvals are explicit and strong: `ApprovalEnvelope` is HMAC-bound to `run_id`, `task_id`, `policy_hash`, `invoice_snapshot_hash`, approver identity/role, and expiry, and `_submit_one()` refuses mock or unconfirmed responses as real success. The critical handoff failure is after the bank boundary: when `_submit_one()` returns `SUBMITTED_UNKNOWN`, execution does not stop and `run_total_submitted` is not incremented, so later approved tasks can still be submitted even though an earlier transfer may already be accepted by the bank. Also, `BankAPIError('transient'|'other')` is treated as `FAILED` rather than an explicit unknown-after-send blocker, and `reconcile_submitted_task()` lacks a broad…
Recommendation
Introduce a hard state break at the bank boundary: after any ambiguous submit result, record the task as `SUBMITTED_UNKNOWN`, reserve/count its amount against the run cap, append a blocker event, notify the inbox, and stop further submissions until reconciliation. Move both transfer and status checks behind a typed bank service boundary whose result envelope distinguishes `not_sent`, `rejected`, `accepted_unconfirmed`, `confirmed`, and `unknown_after_send`.
P1
high riskproduction blocker90/100Design for delegation rather than direct manipulation
The workflow supports delegation through `Policy`, `create_run()`, `draft_tasks()`, signed `ApprovalEnvelope`, and `execute_approved_payments()`, but the delegated constraint `Policy.max_run_total_pence` is not reliably enforced once a bank handoff becomes uncertain. `execute_approved_payments()` initializes `run_total_submitted` from existing `SUBMITTED`, `SUBMITTED_UNKNOWN`, and `SUCCEEDED` tasks, but after `_submit_one()` it increments the total only when `task.status == TaskStatus.SUCCEEDED`; a newly created `SUBMITTED_UNKNOWN` task is not counted before the loop proceeds to the next approved invoice. Because `SUBMITTED_UNKNOWN` explicitly means the bank may have accepted the request, th…
Recommendation
Treat every current-pass bank handoff that may have reached the bank—`SUBMITTED_UNKNOWN`, unconfirmed non-mock responses, and ambiguous transfer errors—as consuming the run budget immediately, and preferably pause/stop the execution loop until reconciliation. Architecturally, move the bank handoff behind a typed service result such as `not_sent`, `accepted_unconfirmed`, `confirmed`, `rejected`, or `unknown_after_send`, and enforce `max_run_total_pence` against the worst-case exposure, not only confirmed success.
P6
needs changesproduction blocker65/100Expose meaningful operational state, not internal complexity
The state model is mostly user-meaningful: `AWAITING_APPROVAL`, `PAUSED`, `SUBMITTED_UNKNOWN`, `SUCCEEDED`, `SIMULATED_SUCCEEDED`, `FAILED`, and `CANCELLED` are explicit, and `inspect_run()` exposes counts and task-level status. The remaining state gap is load-bearing: `_submit_one()` maps all `BankAPIError` values to `FAILED`, despite the declared `transient`/`other` categories being operationally different from a definite rejection. In addition, `execute_approved_payments()` does not treat a newly produced `SUBMITTED_UNKNOWN` as submitted exposure for the run-total projection, so the run’s operational state can understate money at risk during the same pass. Delta: this improves the prior P…
Recommendation
Make the operational state machine distinguish `definitely_failed_before_send`, `rejected_by_bank`, `accepted_unconfirmed`, and `unknown_after_send`. Update the execution projection immediately after each task so `SUBMITTED_UNKNOWN` contributes to exposure and blocks or pauses further submissions until reconciliation.
P?
needs changesproduction blocker60/100Align feedback with the user’s level of attention
The code calibrates many feedback paths well: forged approvals and role failures post critical inbox alerts, mock/unconfirmed bank responses post warnings, and `_submit_one()` posts critical alerts for broad adapter exceptions. However, `reconcile_submitted_task()` catches only `BankAPIError`; `check_transfer_status()` can raise `NotImplementedError` in live mode, and any unexpected status-client exception escapes without a `task.reconcile_blocked` ledger event or durable inbox alert. Separately, `_submit_one()` treats all `BankAPIError` kinds, including `transient` and `other`, as `TaskStatus.FAILED` with retry guidance, even though those classes may represent an ambiguous post-send failure…
Recommendation
Use the same hard feedback primitive for reconciliation as for submission: catch all status-check boundary exceptions, append a typed ledger event such as `task.reconcile_blocked`, post a critical inbox alert, and keep the task in `SUBMITTED_UNKNOWN`. Split `BankAPIError` or replace it with a typed bank result so only definitely-not-sent failures are shown as retryable failures.
P7
needs changesproduction blocker60/100Establish trust through inspectability
Inspectability is supported by a real primitive: `RunLedger` writes hash-chained JSONL events with `prev_event_hash`, `event_hash`, `sequence_no`, and `verify_chain()`, and most approval/bank paths append typed events. The gap is that not every load-bearing bank recovery path is auditable: `reconcile_submitted_task()` lets non-`BankAPIError` exceptions from `check_transfer_status()` escape without a ledger event, and the run-level `submitted_amount_pence` payload in `execute_approved_payments()` is based on `run_total_submitted`, which omits newly created `SUBMITTED_UNKNOWN` tasks during that pass. Raw task events preserve some evidence, but the summarized audit projection can be misleading…
Recommendation
Append a typed audit event for every reconciliation boundary failure before returning or raising, and derive run-level submitted/exposure totals from the full task projection after each submission rather than from a counter updated only on `SUCCEEDED`. For production hardening, anchor or separate the ledger from the execution process, but the immediate blocker is the missing/misleading audit record for ambiguous bank exposure.
P5
needs changesproduction blocker55/100Replace implied magic with clear mental models
The code substantially improves the mental model by using explicit `BANK_MODE`, mock tagging, `confirmed=True` for real success, `TaskStatus.SIMULATED_SUCCEEDED`, and a visible `simulated` field in `inspect_run()`. But the mental model still breaks for ambiguous bank errors: `BankAPIError.kind` explicitly includes `transient` and `other`, yet `_submit_one()` records any `BankAPIError` as `TaskStatus.FAILED` and tells the operator to use `retry_failed_task()`. For a bank call that may have reached the bank before timing out, `FAILED` communicates false certainty; the user-relevant state is unknown/pending reconciliation. Delta: this improves the prior high-risk mock-versus-real-success issue,…
Recommendation
Separate definitely-not-sent failures such as `mode_unset` and `auth_missing` from possibly-sent failures such as network timeout, transient, and unknown adapter errors. Ambiguous outcomes should become `SUBMITTED_UNKNOWN` with reconciliation instructions, not `FAILED` with retry instructions.
P10
needs changesproduction blocker45/100Optimise for steering, not only initiating
The workflow has meaningful steering primitives: `pause_run()`, `resume_run()`, `cancel_run()`, `retry_failed_task()`, `resume_after_failure()`, and `reconcile_submitted_task()`, and `execute_approved_payments()` rereads durable steering intent from `ledger.latest_steering_intent()` between tasks. The remaining issue is that steering state is not always reconciled immediately: `cancel_run()` sets `cancellation_requested=True` and cancels cancellable tasks but does not set `run.status = CANCELLED`, so a run can remain top-level `AWAITING_APPROVAL` until another execution/reconciliation pass. More importantly, after a task enters `SUBMITTED_UNKNOWN`, the workflow should steer itself into a blo…
Recommendation
After cancellation or any ambiguous bank handoff, immediately reconcile the run projection into a user-actionable state: `CANCELLED` when cancellation has taken effect, or a blocked/awaiting-reconciliation state when bank exposure is unknown. Do not require an additional execute pass to make top-level steering state truthful.
P2
alignedEnsure that background work remains perceptible
Background work is made perceptible through persistent run/task state (`RunStatus`, `TaskStatus`), JSONL-backed `RunLedger`, `DurableOperatorInbox`, `inspect_run()` summaries, critical alerts, `audit_chain_intact`, and explicit `SUBMITTED_UNKNOWN` / reconciliation counts. `_submit_one()` now records `task.submission_blocked`, `task.submitted_unconfirmed`, or `task.submitted_mock` rather than leaving the operator without a visible state. Delta: this improves the prior high-risk P2 finding by making the bank-handoff state visible and durable.
P4
alignedApply progressive disclosure to system agency
The default inspection surface in `inspect_run()` gives a compact primary view—`status`, `bank_mode`, `critical_alerts`, `summary.by_status`, real versus simulated success counts, and `needs_reconciliation`—while the detailed task list and `RunLedger.read_run_events()` provide deeper inspection when needed. This separates summary state from diagnostic audit detail without forcing users to parse the full ledger for every routine check. Delta: this maintains the prior aligned result.
P9
alignedRepresent delegated work as a system, not merely as a conversation
Delegated work is represented as a structured system rather than a conversation: `PaymentRun` and `PaymentTask` records are keyed by `run_id`/`task_id`, policy decisions and invoice snapshots are stored per task, state transitions are represented in enums, the ledger provides a timeline, and `inspect_run()` exposes task summaries plus detailed task objects. The workflow separates approval, execution, retry, reconciliation, and inspection rather than relying on a message stream. Delta: this maintains the prior aligned result.
Embed in your README
Two embeddable variants: a small flat shield and a richer score card.
Score card (recommended)
[](https://aidesignblueprint.com/en/readiness-review/5c1d5833-5f29-462f-b1a8-e774498c40fb)
Flat badge
[](https://aidesignblueprint.com/en/readiness-review/5c1d5833-5f29-462f-b1a8-e774498c40fb)
Iteration delta
Regressions (2)
Improvements (4)
Run ID: 5c1d5833-5f29-462f-b1a8-e774498c40fb · Results expire after 90 days
Run by agents. Governed by humans. Validated by the AI Design Blueprint.