Reviewed

Assessment complete; awaiting evidence revision.

Agent Architecture Review, Validation snapshot

Evaluated 10 May 2026 against the AI Design Blueprint doctrine

High Risk

Status: High Risk

30/100

Grade F

3 aligned7 production blockers2 high risk

Blueprint Readiness measures doctrine alignment, not runtime correctness. A production-ready verdict means the architecture embodies the 10 principles; it does not run your tests or types. Layer it on top of your test suite, not in place of it.

Per-principle verdicts

The submission is a substantial improvement over the prior single-shot payment loop: it introduces runs, tasks, explicit policy objects, task states, approval records, cancellation checks, and an audit ledger. However, the production trust boundary still fails for irreversible payments because approval authority is self-declared, the cron path auto-approves, key policy constraints are not enforced, state/alerts remain process-local, and the ledger is not tamper-evident.

Iteration history

1 prior run on this artifact. Each run_id opens its own readiness review.

Scores can move up or down between iterations: the validator's reasoning is not strictly deterministic, so the same artifact can score differently across runs. The per-principle deltas below show the substantive change.

When	Score	Tier	Run ID
10 May 2026 (this run)	30 / F	Draft	b8d61c00…
10 May 2026	0 / F	Draft	ac64d7d9…

Per-principle findings

10 principles evaluated. Verdict, severity, evidence and recommendation for each.

high riskproduction blocker94/100

Make hand-offs, approvals, and blockers explicit

There is a visible approval state: `_submit_one()` only runs for `TaskStatus.APPROVED`, and `approve()` records `approver_user_id`, `approved_policy_hash`, and `approved_at`. However, the approval boundary is not trustworthy for irreversible bank transfers. `approve()` accepts any caller-supplied `approver_user_id` and only checks a self-reported `approver_seen_policy_hash`; it never authenticates the actor or validates `Policy.required_approver_role`. Worse, `run_daily_batch()` auto-approves every `TaskStatus.AWAITING_APPROVAL` task in-process using the default `approver_user_id='cfo@example.com'`, so the cron path can call `transfer_funds()` without an actual human approval. Delta: the pri…

Recommendation

Separate approval from execution. Remove in-process auto-approval from `run_daily_batch()`, require an authenticated approval service to issue a signed approval token bound to `run_id`, `task_id`, invoice snapshot hash, amount, and `policy_hash`, and have the executor accept only those signed approvals from principals with the required role.

high riskproduction blocker86/100

Replace implied magic with clear mental models

The `Policy` dataclass gives the user a clear intended mental model, but the implementation violates that model in load-bearing ways. `max_run_total_pence` is documented as a run-level cap but is never checked; `required_approver_role` is stored but never enforced; `draft_tasks()` can classify invoices using a different `policy` from the one captured in `run.policy_hash`; and audit payloads then record `run.policy_hash`, which can make the decision appear authorised by a policy that was not actually used for classification. Delta: the explicit policy object improves the prior implied magic, but unenforced policy fields create a more dangerous false sense of control.

Recommendation

Convert `Policy` from descriptive configuration into enforced authority. Store the exact policy object on the run, classify only against that stored policy, enforce every declared constraint in code, and fail closed if any policy field is not enforceable. Remove or rename fields that are not actually honoured.

needs changesproduction blocker70/100

Establish trust through inspectability

`RunLedger.append()` writes structured `AuditEvent` rows for run creation, approvals, submissions, successes, failures, cancellations, invoice snapshots, policy decisions, idempotency keys, and bank responses. That is materially more inspectable than the prior loop. But the ledger is only JSONL append-by-convention: there is no hash chain, HMAC/signature, immutable storage, sequence number, or integrity check, despite comments describing it as tamper-evident. It can also log a misleading `policy_hash` if `draft_tasks()` receives a different policy from the run policy. Delta: audit coverage improved, but audit integrity and policy-decision provenance remain insufficient for payment accountabi…

Recommendation

Move the audit ledger outside the execution loop into an append-only durable store with tamper evidence, such as a hash-chained event table or signed ledger. Include previous-event hash, sequence number, actor, exact policy hash used for classification, invoice snapshot hash, approval witness, transfer request hash, and response hash.

needs changesproduction blocker68/100

Ensure that background work remains perceptible

The workflow has visible states via `RunStatus`, `TaskStatus`, and JSONL `AuditEvent`s, and `_submit_one()` records `task.submitted` before calling the bank. However, operational continuity is still process-local: `PaymentWorkflow.runs` and `PaymentWorkflow.tasks` are in-memory dictionaries, `inspect_run()` only reads those projections, and there is no replay path from `RunLedger.read_run_events()` after restart. A crash after `task.submitted` but before `task.succeeded`/`task.failed` would leave the persisted ledger inspectable only manually, not as a resumable workflow. Delta: this is a major improvement over the prior opaque loop, but persistence is not yet a usable lifecycle primitive.

Recommendation

Move run/task projections and execution cursor into a durable store or implement deterministic ledger replay before `inspect_run()` and `execute_approved()`. Treat `submitted_without_response` as an explicit recoverable state and persist cancellation flags outside the process.

needs changesproduction blocker62/100

Design for delegation rather than direct manipulation

The code now models delegated work with `PaymentRun`, `PaymentTask`, and a `Policy` containing amount limits, vendor lists, due-window constraints, and approver-role intent. But the delegation contract is not yet authoritative: `draft_tasks()` accepts an arbitrary `policy` argument without checking `policy.hash() == run.policy_hash`, `Policy.max_run_total_pence` is never enforced, and `Policy.required_approver_role` is never checked. Delta: this improves the prior run/task/policy gap, but the current policy object is still partly advisory rather than a hard boundary.

Recommendation

Make the run-owned policy the single source of authority: persist the policy or signed policy envelope with the run, reject drafting/execution if the supplied policy hash differs from `run.policy_hash`, and enforce run-total, invoice, vendor, due-window, and approver-role constraints before any task can become executable.

P10

needs changesproduction blocker58/100

Optimise for steering, not only initiating

The workflow has an initial steering primitive: `cancel_run()` sets `cancellation_requested`, marks not-yet-submitted tasks `CANCELLED`, and `execute_approved()` checks the flag before each call to `transfer_funds()`. Idempotency keys also reduce duplicate-payment risk. But steering is incomplete and partly non-durable: the cancellation flag is only in memory, there is no pause/resume or reprioritisation path, failed tasks cannot actually be retried even though the inbox message says to retry from the operator inbox, and `execute_approved()` ignores `TaskStatus.FAILED` rather than offering a controlled replay with the same idempotency key. Delta: this improves the prior no-interrupt design,…

Recommendation

Persist steering state and execution cursor outside the process, then add explicit lifecycle operations for pause, resume, cancel, and retry/requeue of failed or submitted-unknown tasks using the original idempotency key. Keep the worker separate from the approval and steering surfaces so operators can intervene safely between external actions.

needs changesproduction blocker50/100

Align feedback with the user’s level of attention

The code separates routine progress from intervention paths by recording normal transitions in `_record()` and escalating `BankAPIError` plus policy-hash mismatch through `OperatorInbox.post()`. But `OperatorInbox` is only an in-memory `list`, and `run_daily_batch()` creates it locally and discards it, so a critical `auth_missing` alert may never reach an operator in a cron/background context. Delta: this addresses the prior swallowed-exception pattern, but the escalation channel is still not durable or operator-facing.

Recommendation

Move `OperatorInbox` behind a durable operator-facing notification surface, such as a database-backed inbox, dashboard, or incident channel. Return or persist alert IDs from `run_daily_batch()` so critical failures cannot disappear with the process.

aligned

Apply progressive disclosure to system agency

`inspect_run()` provides a concise top-level run view with `run_id`, `status`, `policy_hash`, cancellation fields, and a `summary` grouped by task status, then exposes per-task drill-down including `policy_decision`, invoice snapshot, approval identity/time, transfer ID, and failure details. Delta: this improves the prior lack of progressive inspection with a clear summary-plus-detail structure.

aligned

Expose meaningful operational state, not internal complexity

The code exposes user-relevant lifecycle states through `RunStatus` values such as `awaiting_approval`, `executing`, `partially_completed`, and `failed`, plus `TaskStatus` values such as `awaiting_approval`, `approved`, `submitted`, `succeeded`, `failed`, `skipped`, and `cancelled`. Technical details are kept in diagnostic fields like `failure_class`, `failure_message`, and `bank_response` rather than replacing the operational state. Delta: this addresses the prior single-shot loop by making payment progress understandable at the run and task levels.

aligned

Represent delegated work as a system, not merely as a conversation

The workflow is represented as a structured system rather than a conversation: `PaymentRun` owns child `PaymentTask`s, `RunStatus` and `TaskStatus` encode lifecycle, `_record()` emits timeline events, `execute_approved()` derives final run outcome from task states, and `inspect_run()` shows the run/task hierarchy. Delta: this resolves the prior lack of a system representation, although separate durability concerns are covered under P2 and P7.

Embed in your README

Two embeddable variants: a small flat shield and a richer score card.

Score card (recommended)

[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/b8d61c00-2b86-45f0-a533-526202371592/card.svg)](https://aidesignblueprint.com/en/readiness-review/b8d61c00-2b86-45f0-a533-526202371592)

Flat badge

[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/b8d61c00-2b86-45f0-a533-526202371592.svg)](https://aidesignblueprint.com/en/readiness-review/b8d61c00-2b86-45f0-a533-526202371592)

Baseline and iteration details

Baseline: usedDoctrine: same doctrineRace: checked clear

Iteration delta

9 closed this pass0 reopened2 high-risk findings still open

Improvements (9)

P1Design for delegation rather than direct manipulationneeds_changesneeds_changes

P2Ensure that background work remains perceptiblehigh_riskneeds_changes

P3Align feedback with the user’s level of attentionhigh_riskneeds_changes

P4Apply progressive disclosure to system agencyneeds_changesaligned

P6Expose meaningful operational state, not internal complexityhigh_riskaligned

P7Establish trust through inspectabilityhigh_riskneeds_changes

P8Make hand-offs, approvals, and blockers explicithigh_riskhigh_risk

P9Represent delegated work as a system, not merely as a conversationneeds_changesaligned

P10Optimise for steering, not only initiatinghigh_riskneeds_changes

Rubric: 2026-05-04

Run your own validation AI Design Blueprint

Run ID: b8d61c00-2b86-45f0-a533-526202371592 · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.