Reviewed

Assessment complete; awaiting evidence revision.

Agent Architecture Review, Validation snapshot

Evaluated 10 May 2026 against the AI Design Blueprint doctrine

Emerging

Status: High Risk

60/100

Grade C

6 aligned4 production blockers2 high risk

Blueprint Readiness measures doctrine alignment, not runtime correctness. A production-ready verdict means the architecture embodies the 10 principles; it does not run your tests or types. Layer it on top of your test suite, not in place of it.

Per-principle verdicts

This is an autonomous payment workflow with meaningful delegation, policy, status, inbox, replay, and audit primitives. However, the irreversible bank-transfer boundary is still unsafe: approvals are accepted from raw caller-supplied identity/role fields rather than a verifiable approval authority, and recovery/steering around submitted or failed payments can strand runs.

Iteration history

2 prior runs on this artifact. Each run_id opens its own readiness review.

Scores can move up or down between iterations: the validator's reasoning is not strictly deterministic, so the same artifact can score differently across runs. The per-principle deltas below show the substantive change.

When	Score	Tier	Run ID
10 May 2026 (this run)	60 / C	Emerging	15aa9649…
10 May 2026	30 / F	Draft	b8d61c00…
10 May 2026	0 / F	Draft	ac64d7d9…

Per-principle findings

10 principles evaluated. Verdict, severity, evidence and recommendation for each.

high riskproduction blocker90/100

Make hand-offs, approvals, and blockers explicit

The code correctly separates drafting from execution: `cron_draft_daily_run()` only drafts, and `execute_approved_payments()` only submits `TaskStatus.APPROVED` tasks. However, the approval gate before an irreversible bank transfer is not trustworthy because `approve()` accepts `approver_user_id`, `approver_role`, and `approver_seen_policy_hash` as ordinary caller-provided strings. The comment says the caller must authenticate before calling, but the workflow itself does not verify a signed token or approval record. Any caller with access to this method can pass `approver_role='cfo'` and create a `task.approved` ledger event. Delta: this improves the prior P8 high-risk finding by removing in…

Recommendation

Put approval authority outside the execution object: require `approve()` to consume and verify a signed approval token or persisted approval record from an authenticated approval service, bound to the approver identity, role, `run_id`, `task_id`, `policy_hash`, and `invoice_snapshot_hash`, before setting `TaskStatus.APPROVED`.

P10

high riskproduction blocker84/100

Optimise for steering, not only initiating

The code adds steering methods, but several are not yet reliable at the payment boundary. `retry_failed_task()` changes a failed task back to `APPROVED`, yet `execute_approved_payments()` refuses to run when the run is already `FAILED` or `PARTIALLY_COMPLETED`, which is the normal state after execution failures. Pause/cancel checks inside `execute_approved_payments()` read only the in-memory `run.pause_requested`/`run.cancellation_requested` flags, so a separate operator process appending pause/cancel intent to the ledger would not interrupt an already-running executor. There is also no steering path for `TaskStatus.SUBMITTED` after a crash around `transfer_funds()`. Delta: this improves the…

Recommendation

Move steering state and execution cursor into a transactional store that the executor re-reads before each external submit; allow resumed execution from `FAILED`, `PARTIALLY_COMPLETED`, and `SUBMITTED_UNKNOWN` states through explicit reconciliation/requeue transitions that preserve the original `idempotency_key`.

needs changesproduction blocker72/100

Ensure that background work remains perceptible

The code adds perceptible state via `RunStatus`, `TaskStatus`, `inspect_run()`, JSONL `RunLedger`, and `replay_run_from_ledger()`, but it can still strand background work around the irreversible side-effect boundary. `_submit_one()` appends `task.submitted` before calling `transfer_funds()`; if the process dies during or after the bank call but before `task.succeeded`/`task.failed`, replay leaves the task in `TaskStatus.SUBMITTED`, `execute_approved_payments()` skips non-`APPROVED` tasks, and `retry_failed_task()` only handles `FAILED` tasks. The user can return to a durable but unresolved `submitted`/`executing` state with no reconciliation path. Delta: the current code improves the prior d…

Recommendation

Move execution state and the bank submission cursor into a transactional durable store, and model `SUBMITTED_UNKNOWN`/reconciliation as a first-class state that can be safely resumed using the existing `idempotency_key` before any further payments are attempted.

needs changesproduction blocker64/100

Replace implied magic with clear mental models

Most policy fields are now load-bearing: `classify_invoice()` enforces vendor blocklist, due window, amount range, allowlist, and per-invoice cap; `draft_tasks()` rejects a mismatched `policy.hash()`; `execute_approved_payments()` enforces `max_run_total_pence`; and `approve()` checks `approver_seen_policy_hash`. The remaining mental-model failure is that `Policy.required_approver_role` appears to authorize payment approval, but `approve()` merely compares it with caller-supplied `approver_role`; there is no signed approval envelope or identity-service proof that `approver_user_id` actually has that role. Delta: this improves the prior high-risk policy-enforcement finding, but the approval-a…

Recommendation

Replace raw `approver_user_id`/`approver_role` parameters with a verified approval envelope issued by the identity/approval service and bound to `run_id`, `task_id`, `policy_hash`, and `invoice_snapshot_hash`; make that envelope the only source of approval authority.

aligned

Design for delegation rather than direct manipulation

The workflow is structured around delegated work rather than manual step execution: `Policy` captures operator constraints, `create_run()` binds a `policy_hash`, `draft_tasks()` classifies invoices, and `execute_approved_payments()` submits only approved tasks. Operators have explicit controls through `approve()`, `pause_run()`, `resume_run()`, `cancel_run()`, and `retry_failed_task()`. Delta: this improves the prior P1 finding by making the run-owned policy hash load-bearing in drafting and audit events.

aligned

Align feedback with the user’s level of attention

Feedback is proportionate and durable: `DurableOperatorInbox.post()` persists intervention alerts to JSONL, bank and policy failures create warning/critical alerts, and `inspect_run()` surfaces `critical_alerts` plus task-level `failure_class` and `failure_message`. Routine state is summarized through `summary.by_status`, while attention-required failures carry alert IDs. Delta: this addresses the prior P3 recommendation by replacing an in-memory inbox with a durable operator-facing inbox.

aligned

Apply progressive disclosure to system agency

`inspect_run()` separates a primary operational summary (`status`, `policy_hash`, `audit_chain_intact`, `critical_alerts`, and `summary.by_status`) from per-task details such as invoice snapshot hashes, approver fields, transfer IDs, and failure messages. The code therefore supports a default overview with deeper inspection when needed. Delta: this maintains the prior aligned P4 result.

aligned

Expose meaningful operational state, not internal complexity

The workflow exposes user-relevant states through `RunStatus` values such as `AWAITING_APPROVAL`, `PAUSED`, `EXECUTING`, `PARTIALLY_COMPLETED`, `CANCELLED`, and `FAILED`, plus `TaskStatus` values such as `APPROVED`, `SUBMITTED`, `SUCCEEDED`, `FAILED`, and `SKIPPED`. `inspect_run()` presents these states with counts and actionable task failure fields rather than only low-level log entries. Delta: this maintains the prior aligned P6 result.

aligned

Establish trust through inspectability

The audit path includes a real inspectability primitive: every `AuditEvent` carries `sequence_no`, `prev_event_hash`, and `event_hash`; `_hash_event_body()` canonicalizes the event body; `RunLedger.append()` chains per-run events; and `verify_chain()` detects sequence, previous-hash, or body-hash divergence. Events include policy hashes, invoice snapshots, approvals, submissions, bank responses, failures, cancellations, and retries, and `inspect_run()` reports `audit_chain_intact` plus the first divergent event ID. Delta: this addresses the prior P7 recommendation by adding a hash-chained ledger and verification path.

aligned

Represent delegated work as a system, not merely as a conversation

Delegated work is represented as a structured system: `PaymentRun` owns the lifecycle, `PaymentTask` models per-invoice work, `RunLedger` records ordered events, `DurableOperatorInbox` records intervention alerts, and `inspect_run()` returns a task list plus aggregate status counts. Execution state is separate from conversational or prompt-like input. Delta: this maintains the prior aligned P9 result.

Embed in your README

Two embeddable variants: a small flat shield and a richer score card.

Score card (recommended)

[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/15aa9649-84d8-4a75-aec7-fd101a5b0535/card.svg)](https://aidesignblueprint.com/en/readiness-review/15aa9649-84d8-4a75-aec7-fd101a5b0535)

Flat badge

[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/15aa9649-84d8-4a75-aec7-fd101a5b0535.svg)](https://aidesignblueprint.com/en/readiness-review/15aa9649-84d8-4a75-aec7-fd101a5b0535)

Baseline and iteration details

Baseline: usedDoctrine: same doctrineRace: checked clear

Iteration delta

5 closed this pass2 reopened2 high-risk findings still open

Regressions (2)

P2Ensure that background work remains perceptibleneeds_changesneeds_changes

P10Optimise for steering, not only initiatingneeds_changeshigh_risk

Improvements (5)

P1Design for delegation rather than direct manipulationneeds_changesaligned

P3Align feedback with the user’s level of attentionneeds_changesaligned

P5Replace implied magic with clear mental modelshigh_riskneeds_changes

P7Establish trust through inspectabilityneeds_changesaligned

P8Make hand-offs, approvals, and blockers explicithigh_riskhigh_risk

Rubric: 2026-05-04

Run your own validation AI Design Blueprint

Run ID: 15aa9649-84d8-4a75-aec7-fd101a5b0535 · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.