Reviewed

Assessment complete; awaiting evidence revision.

Agent Architecture Review, Validation snapshot

Evaluated 10 May 2026 against the AI Design Blueprint doctrine

High Risk

Status: High Risk

30/100

Grade F

3 aligned7 production blockers2 high risk

Blueprint Readiness measures doctrine alignment, not runtime correctness. A production-ready verdict means the architecture embodies the 10 principles; it does not run your tests or types. Layer it on top of your test suite, not in place of it.

Per-principle verdicts

The submission is clearly an autonomous payment workflow and includes several strong primitives: HMAC-bound approval envelopes, explicit run/task state, a durable inbox, a hash-chained ledger, explicit BANK_MODE handling, SUBMITTED_UNKNOWN, and separate SIMULATED_SUCCEEDED state. However, a load-bearing payment constraint still fails: newly created SUBMITTED_UNKNOWN bank handoffs are not counted against max_run_total_pence during the same execution pass, so the workflow can continue submitting additional payments even though the bank may already have accepted earlier ones. Reconciliation and bank-error semantics also remain incomplete for ambiguous transfer/status failures.

Iteration history

5 prior runs on this artifact. Each run_id opens its own readiness review.

Scores can move up or down between iterations: the validator's reasoning is not strictly deterministic, so the same artifact can score differently across runs. The per-principle deltas below show the substantive change.

When	Score	Tier	Run ID
10 May 2026 (this run)	30 / F	Draft	5c1d5833…
10 May 2026	40 / D	Draft	b1195c34…
10 May 2026	74 / C	Emerging	dd3a9348…
10 May 2026	74 / C	Emerging	b4799966…
10 May 2026	74 / C	Emerging	7e9bc0f6…
10 May 2026	60 / C	Emerging	15aa9649…

Per-principle findings

10 principles evaluated. Verdict, severity, evidence and recommendation for each.

high riskproduction blocker92/100

Make hand-offs, approvals, and blockers explicit

Approvals are explicit and strong: `ApprovalEnvelope` is HMAC-bound to `run_id`, `task_id`, `policy_hash`, `invoice_snapshot_hash`, approver identity/role, and expiry, and `_submit_one()` refuses mock or unconfirmed responses as real success. The critical handoff failure is after the bank boundary: when `_submit_one()` returns `SUBMITTED_UNKNOWN`, execution does not stop and `run_total_submitted` is not incremented, so later approved tasks can still be submitted even though an earlier transfer may already be accepted by the bank. Also, `BankAPIError('transient'|'other')` is treated as `FAILED` rather than an explicit unknown-after-send blocker, and `reconcile_submitted_task()` lacks a broad…

Recommendation

Introduce a hard state break at the bank boundary: after any ambiguous submit result, record the task as `SUBMITTED_UNKNOWN`, reserve/count its amount against the run cap, append a blocker event, notify the inbox, and stop further submissions until reconciliation. Move both transfer and status checks behind a typed bank service boundary whose result envelope distinguishes `not_sent`, `rejected`, `accepted_unconfirmed`, `confirmed`, and `unknown_after_send`.

high riskproduction blocker90/100

Design for delegation rather than direct manipulation

The workflow supports delegation through `Policy`, `create_run()`, `draft_tasks()`, signed `ApprovalEnvelope`, and `execute_approved_payments()`, but the delegated constraint `Policy.max_run_total_pence` is not reliably enforced once a bank handoff becomes uncertain. `execute_approved_payments()` initializes `run_total_submitted` from existing `SUBMITTED`, `SUBMITTED_UNKNOWN`, and `SUCCEEDED` tasks, but after `_submit_one()` it increments the total only when `task.status == TaskStatus.SUCCEEDED`; a newly created `SUBMITTED_UNKNOWN` task is not counted before the loop proceeds to the next approved invoice. Because `SUBMITTED_UNKNOWN` explicitly means the bank may have accepted the request, th…

Recommendation

Treat every current-pass bank handoff that may have reached the bank—`SUBMITTED_UNKNOWN`, unconfirmed non-mock responses, and ambiguous transfer errors—as consuming the run budget immediately, and preferably pause/stop the execution loop until reconciliation. Architecturally, move the bank handoff behind a typed service result such as `not_sent`, `accepted_unconfirmed`, `confirmed`, `rejected`, or `unknown_after_send`, and enforce `max_run_total_pence` against the worst-case exposure, not only confirmed success.

needs changesproduction blocker65/100

Expose meaningful operational state, not internal complexity

The state model is mostly user-meaningful: `AWAITING_APPROVAL`, `PAUSED`, `SUBMITTED_UNKNOWN`, `SUCCEEDED`, `SIMULATED_SUCCEEDED`, `FAILED`, and `CANCELLED` are explicit, and `inspect_run()` exposes counts and task-level status. The remaining state gap is load-bearing: `_submit_one()` maps all `BankAPIError` values to `FAILED`, despite the declared `transient`/`other` categories being operationally different from a definite rejection. In addition, `execute_approved_payments()` does not treat a newly produced `SUBMITTED_UNKNOWN` as submitted exposure for the run-total projection, so the run’s operational state can understate money at risk during the same pass. Delta: this improves the prior P…

Recommendation

Make the operational state machine distinguish `definitely_failed_before_send`, `rejected_by_bank`, `accepted_unconfirmed`, and `unknown_after_send`. Update the execution projection immediately after each task so `SUBMITTED_UNKNOWN` contributes to exposure and blocks or pauses further submissions until reconciliation.

needs changesproduction blocker60/100

Align feedback with the user’s level of attention

The code calibrates many feedback paths well: forged approvals and role failures post critical inbox alerts, mock/unconfirmed bank responses post warnings, and `_submit_one()` posts critical alerts for broad adapter exceptions. However, `reconcile_submitted_task()` catches only `BankAPIError`; `check_transfer_status()` can raise `NotImplementedError` in live mode, and any unexpected status-client exception escapes without a `task.reconcile_blocked` ledger event or durable inbox alert. Separately, `_submit_one()` treats all `BankAPIError` kinds, including `transient` and `other`, as `TaskStatus.FAILED` with retry guidance, even though those classes may represent an ambiguous post-send failure…

Recommendation

Use the same hard feedback primitive for reconciliation as for submission: catch all status-check boundary exceptions, append a typed ledger event such as `task.reconcile_blocked`, post a critical inbox alert, and keep the task in `SUBMITTED_UNKNOWN`. Split `BankAPIError` or replace it with a typed bank result so only definitely-not-sent failures are shown as retryable failures.

needs changesproduction blocker60/100

Establish trust through inspectability

Inspectability is supported by a real primitive: `RunLedger` writes hash-chained JSONL events with `prev_event_hash`, `event_hash`, `sequence_no`, and `verify_chain()`, and most approval/bank paths append typed events. The gap is that not every load-bearing bank recovery path is auditable: `reconcile_submitted_task()` lets non-`BankAPIError` exceptions from `check_transfer_status()` escape without a ledger event, and the run-level `submitted_amount_pence` payload in `execute_approved_payments()` is based on `run_total_submitted`, which omits newly created `SUBMITTED_UNKNOWN` tasks during that pass. Raw task events preserve some evidence, but the summarized audit projection can be misleading…

Recommendation

Append a typed audit event for every reconciliation boundary failure before returning or raising, and derive run-level submitted/exposure totals from the full task projection after each submission rather than from a counter updated only on `SUCCEEDED`. For production hardening, anchor or separate the ledger from the execution process, but the immediate blocker is the missing/misleading audit record for ambiguous bank exposure.

needs changesproduction blocker55/100

Replace implied magic with clear mental models

The code substantially improves the mental model by using explicit `BANK_MODE`, mock tagging, `confirmed=True` for real success, `TaskStatus.SIMULATED_SUCCEEDED`, and a visible `simulated` field in `inspect_run()`. But the mental model still breaks for ambiguous bank errors: `BankAPIError.kind` explicitly includes `transient` and `other`, yet `_submit_one()` records any `BankAPIError` as `TaskStatus.FAILED` and tells the operator to use `retry_failed_task()`. For a bank call that may have reached the bank before timing out, `FAILED` communicates false certainty; the user-relevant state is unknown/pending reconciliation. Delta: this improves the prior high-risk mock-versus-real-success issue,…

Recommendation

Separate definitely-not-sent failures such as `mode_unset` and `auth_missing` from possibly-sent failures such as network timeout, transient, and unknown adapter errors. Ambiguous outcomes should become `SUBMITTED_UNKNOWN` with reconciliation instructions, not `FAILED` with retry instructions.

P10

needs changesproduction blocker45/100

Optimise for steering, not only initiating

The workflow has meaningful steering primitives: `pause_run()`, `resume_run()`, `cancel_run()`, `retry_failed_task()`, `resume_after_failure()`, and `reconcile_submitted_task()`, and `execute_approved_payments()` rereads durable steering intent from `ledger.latest_steering_intent()` between tasks. The remaining issue is that steering state is not always reconciled immediately: `cancel_run()` sets `cancellation_requested=True` and cancels cancellable tasks but does not set `run.status = CANCELLED`, so a run can remain top-level `AWAITING_APPROVAL` until another execution/reconciliation pass. More importantly, after a task enters `SUBMITTED_UNKNOWN`, the workflow should steer itself into a blo…

Recommendation

After cancellation or any ambiguous bank handoff, immediately reconcile the run projection into a user-actionable state: `CANCELLED` when cancellation has taken effect, or a blocked/awaiting-reconciliation state when bank exposure is unknown. Do not require an additional execute pass to make top-level steering state truthful.

aligned

Ensure that background work remains perceptible

Background work is made perceptible through persistent run/task state (`RunStatus`, `TaskStatus`), JSONL-backed `RunLedger`, `DurableOperatorInbox`, `inspect_run()` summaries, critical alerts, `audit_chain_intact`, and explicit `SUBMITTED_UNKNOWN` / reconciliation counts. `_submit_one()` now records `task.submission_blocked`, `task.submitted_unconfirmed`, or `task.submitted_mock` rather than leaving the operator without a visible state. Delta: this improves the prior high-risk P2 finding by making the bank-handoff state visible and durable.

aligned

Apply progressive disclosure to system agency

The default inspection surface in `inspect_run()` gives a compact primary view—`status`, `bank_mode`, `critical_alerts`, `summary.by_status`, real versus simulated success counts, and `needs_reconciliation`—while the detailed task list and `RunLedger.read_run_events()` provide deeper inspection when needed. This separates summary state from diagnostic audit detail without forcing users to parse the full ledger for every routine check. Delta: this maintains the prior aligned result.

aligned

Represent delegated work as a system, not merely as a conversation

Delegated work is represented as a structured system rather than a conversation: `PaymentRun` and `PaymentTask` records are keyed by `run_id`/`task_id`, policy decisions and invoice snapshots are stored per task, state transitions are represented in enums, the ledger provides a timeline, and `inspect_run()` exposes task summaries plus detailed task objects. The workflow separates approval, execution, retry, reconciliation, and inspection rather than relying on a message stream. Delta: this maintains the prior aligned result.

Embed in your README

Two embeddable variants: a small flat shield and a richer score card.

Score card (recommended)

[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/5c1d5833-5f29-462f-b1a8-e774498c40fb/card.svg)](https://aidesignblueprint.com/en/readiness-review/5c1d5833-5f29-462f-b1a8-e774498c40fb)

Flat badge

[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/5c1d5833-5f29-462f-b1a8-e774498c40fb.svg)](https://aidesignblueprint.com/en/readiness-review/5c1d5833-5f29-462f-b1a8-e774498c40fb)

Baseline and iteration details

Baseline: usedDoctrine: same doctrineRace: checked clear

Iteration delta

4 closed this pass2 reopened2 high-risk findings still open

Regressions (2)

P1Design for delegation rather than direct manipulationalignedhigh_risk

P10Optimise for steering, not only initiatingalignedneeds_changes

Improvements (4)

P2Ensure that background work remains perceptiblehigh_riskaligned

P3Align feedback with the user’s level of attentionneeds_changesneeds_changes

P5Replace implied magic with clear mental modelshigh_riskneeds_changes

P7Establish trust through inspectabilityneeds_changesneeds_changes

Rubric: 2026-05-04

Run your own validation AI Design Blueprint

Run ID: 5c1d5833-5f29-462f-b1a8-e774498c40fb · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.