Assessment complete; awaiting evidence revision.
Evaluated 10 May 2026 against the AI Design Blueprint doctrine
Emerging
Status: High Risk
60/100
Grade C
Blueprint Readiness measures doctrine alignment, not runtime correctness. A production-ready verdict means the architecture embodies the 10 principles; it does not run your tests or types. Layer it on top of your test suite, not in place of it.
This is an autonomous payment workflow with meaningful delegation, policy, status, inbox, replay, and audit primitives. However, the irreversible bank-transfer boundary is still unsafe: approvals are accepted from raw caller-supplied identity/role fields rather than a verifiable approval authority, and recovery/steering around submitted or failed payments can strand runs.
Iteration history
2 prior runs on this artifact. Each run_id opens its own readiness review.
Scores can move up or down between iterations: the validator's reasoning is not strictly deterministic, so the same artifact can score differently across runs. The per-principle deltas below show the substantive change.
Per-principle findings
10 principles evaluated. Verdict, severity, evidence and recommendation for each.
P8
high riskproduction blocker90/100Make hand-offs, approvals, and blockers explicit
The code correctly separates drafting from execution: `cron_draft_daily_run()` only drafts, and `execute_approved_payments()` only submits `TaskStatus.APPROVED` tasks. However, the approval gate before an irreversible bank transfer is not trustworthy because `approve()` accepts `approver_user_id`, `approver_role`, and `approver_seen_policy_hash` as ordinary caller-provided strings. The comment says the caller must authenticate before calling, but the workflow itself does not verify a signed token or approval record. Any caller with access to this method can pass `approver_role='cfo'` and create a `task.approved` ledger event. Delta: this improves the prior P8 high-risk finding by removing in…
Recommendation
Put approval authority outside the execution object: require `approve()` to consume and verify a signed approval token or persisted approval record from an authenticated approval service, bound to the approver identity, role, `run_id`, `task_id`, `policy_hash`, and `invoice_snapshot_hash`, before setting `TaskStatus.APPROVED`.
P10
high riskproduction blocker84/100Optimise for steering, not only initiating
The code adds steering methods, but several are not yet reliable at the payment boundary. `retry_failed_task()` changes a failed task back to `APPROVED`, yet `execute_approved_payments()` refuses to run when the run is already `FAILED` or `PARTIALLY_COMPLETED`, which is the normal state after execution failures. Pause/cancel checks inside `execute_approved_payments()` read only the in-memory `run.pause_requested`/`run.cancellation_requested` flags, so a separate operator process appending pause/cancel intent to the ledger would not interrupt an already-running executor. There is also no steering path for `TaskStatus.SUBMITTED` after a crash around `transfer_funds()`. Delta: this improves the…
Recommendation
Move steering state and execution cursor into a transactional store that the executor re-reads before each external submit; allow resumed execution from `FAILED`, `PARTIALLY_COMPLETED`, and `SUBMITTED_UNKNOWN` states through explicit reconciliation/requeue transitions that preserve the original `idempotency_key`.
P2
needs changesproduction blocker72/100Ensure that background work remains perceptible
The code adds perceptible state via `RunStatus`, `TaskStatus`, `inspect_run()`, JSONL `RunLedger`, and `replay_run_from_ledger()`, but it can still strand background work around the irreversible side-effect boundary. `_submit_one()` appends `task.submitted` before calling `transfer_funds()`; if the process dies during or after the bank call but before `task.succeeded`/`task.failed`, replay leaves the task in `TaskStatus.SUBMITTED`, `execute_approved_payments()` skips non-`APPROVED` tasks, and `retry_failed_task()` only handles `FAILED` tasks. The user can return to a durable but unresolved `submitted`/`executing` state with no reconciliation path. Delta: the current code improves the prior d…
Recommendation
Move execution state and the bank submission cursor into a transactional durable store, and model `SUBMITTED_UNKNOWN`/reconciliation as a first-class state that can be safely resumed using the existing `idempotency_key` before any further payments are attempted.
P5
needs changesproduction blocker64/100Replace implied magic with clear mental models
Most policy fields are now load-bearing: `classify_invoice()` enforces vendor blocklist, due window, amount range, allowlist, and per-invoice cap; `draft_tasks()` rejects a mismatched `policy.hash()`; `execute_approved_payments()` enforces `max_run_total_pence`; and `approve()` checks `approver_seen_policy_hash`. The remaining mental-model failure is that `Policy.required_approver_role` appears to authorize payment approval, but `approve()` merely compares it with caller-supplied `approver_role`; there is no signed approval envelope or identity-service proof that `approver_user_id` actually has that role. Delta: this improves the prior high-risk policy-enforcement finding, but the approval-a…
Recommendation
Replace raw `approver_user_id`/`approver_role` parameters with a verified approval envelope issued by the identity/approval service and bound to `run_id`, `task_id`, `policy_hash`, and `invoice_snapshot_hash`; make that envelope the only source of approval authority.
P1
alignedDesign for delegation rather than direct manipulation
The workflow is structured around delegated work rather than manual step execution: `Policy` captures operator constraints, `create_run()` binds a `policy_hash`, `draft_tasks()` classifies invoices, and `execute_approved_payments()` submits only approved tasks. Operators have explicit controls through `approve()`, `pause_run()`, `resume_run()`, `cancel_run()`, and `retry_failed_task()`. Delta: this improves the prior P1 finding by making the run-owned policy hash load-bearing in drafting and audit events.
P?
alignedAlign feedback with the user’s level of attention
Feedback is proportionate and durable: `DurableOperatorInbox.post()` persists intervention alerts to JSONL, bank and policy failures create warning/critical alerts, and `inspect_run()` surfaces `critical_alerts` plus task-level `failure_class` and `failure_message`. Routine state is summarized through `summary.by_status`, while attention-required failures carry alert IDs. Delta: this addresses the prior P3 recommendation by replacing an in-memory inbox with a durable operator-facing inbox.
P4
alignedApply progressive disclosure to system agency
`inspect_run()` separates a primary operational summary (`status`, `policy_hash`, `audit_chain_intact`, `critical_alerts`, and `summary.by_status`) from per-task details such as invoice snapshot hashes, approver fields, transfer IDs, and failure messages. The code therefore supports a default overview with deeper inspection when needed. Delta: this maintains the prior aligned P4 result.
P6
alignedExpose meaningful operational state, not internal complexity
The workflow exposes user-relevant states through `RunStatus` values such as `AWAITING_APPROVAL`, `PAUSED`, `EXECUTING`, `PARTIALLY_COMPLETED`, `CANCELLED`, and `FAILED`, plus `TaskStatus` values such as `APPROVED`, `SUBMITTED`, `SUCCEEDED`, `FAILED`, and `SKIPPED`. `inspect_run()` presents these states with counts and actionable task failure fields rather than only low-level log entries. Delta: this maintains the prior aligned P6 result.
P7
alignedEstablish trust through inspectability
The audit path includes a real inspectability primitive: every `AuditEvent` carries `sequence_no`, `prev_event_hash`, and `event_hash`; `_hash_event_body()` canonicalizes the event body; `RunLedger.append()` chains per-run events; and `verify_chain()` detects sequence, previous-hash, or body-hash divergence. Events include policy hashes, invoice snapshots, approvals, submissions, bank responses, failures, cancellations, and retries, and `inspect_run()` reports `audit_chain_intact` plus the first divergent event ID. Delta: this addresses the prior P7 recommendation by adding a hash-chained ledger and verification path.
P9
alignedRepresent delegated work as a system, not merely as a conversation
Delegated work is represented as a structured system: `PaymentRun` owns the lifecycle, `PaymentTask` models per-invoice work, `RunLedger` records ordered events, `DurableOperatorInbox` records intervention alerts, and `inspect_run()` returns a task list plus aggregate status counts. Execution state is separate from conversational or prompt-like input. Delta: this maintains the prior aligned P9 result.
Embed in your README
Two embeddable variants: a small flat shield and a richer score card.
Score card (recommended)
[](https://aidesignblueprint.com/en/readiness-review/15aa9649-84d8-4a75-aec7-fd101a5b0535)
Flat badge
[](https://aidesignblueprint.com/en/readiness-review/15aa9649-84d8-4a75-aec7-fd101a5b0535)
Iteration delta
Regressions (2)
Improvements (5)
Run ID: 15aa9649-84d8-4a75-aec7-fd101a5b0535 · Results expire after 90 days
Run by agents. Governed by humans. Validated by the AI Design Blueprint.