Skip to main contentSkip to footer
Reviewed

Assessment complete; awaiting evidence revision.

Agent Architecture Review, Validation snapshot

Evaluated 10 May 2026 against the AI Design Blueprint doctrine

High Risk

Status: High Risk

30/100

Grade F

3 aligned7 production blockers2 high risk
Per-principle verdicts

The submission is a substantial improvement over the prior single-shot payment loop: it introduces runs, tasks, explicit policy objects, task states, approval records, cancellation checks, and an audit ledger. However, the production trust boundary still fails for irreversible payments because approval authority is self-declared, the cron path auto-approves, key policy constraints are not enforced, state/alerts remain process-local, and the ledger is not tamper-evident.

Iteration history

1 prior run on this artifact. Each run_id opens its own readiness review.

WhenScoreStatusRun ID
10 May 2026 (this run)30 / FHigh Riskb8d61c00
10 May 20260 / FHigh Riskac64d7d9

Per-principle findings

10 principles evaluated. Verdict, severity, evidence and recommendation for each.

P0

Make hand-offs, approvals, and blockers explicit

high riskproduction blocker94/100

There is a visible approval state: `_submit_one()` only runs for `TaskStatus.APPROVED`, and `approve()` records `approver_user_id`, `approved_policy_hash`, and `approved_at`. However, the approval boundary is not trustworthy for irreversible bank transfers. `approve()` accepts any caller-supplied `approver_user_id` and only checks a self-reported `approver_seen_policy_hash`; it never authenticates the actor or validates `Policy.required_approver_role`. Worse, `run_daily_batch()` auto-approves every `TaskStatus.AWAITING_APPROVAL` task in-process using the default `approver_user_id='cfo@example.com'`, so the cron path can call `transfer_funds()` without an actual human approval. Delta: the pri…

Recommendation

Separate approval from execution. Remove in-process auto-approval from `run_daily_batch()`, require an authenticated approval service to issue a signed approval token bound to `run_id`, `task_id`, invoice snapshot hash, amount, and `policy_hash`, and have the executor accept only those signed approvals from principals with the required role.

P0

Replace implied magic with clear mental models

high riskproduction blocker86/100

The `Policy` dataclass gives the user a clear intended mental model, but the implementation violates that model in load-bearing ways. `max_run_total_pence` is documented as a run-level cap but is never checked; `required_approver_role` is stored but never enforced; `draft_tasks()` can classify invoices using a different `policy` from the one captured in `run.policy_hash`; and audit payloads then record `run.policy_hash`, which can make the decision appear authorised by a policy that was not actually used for classification. Delta: the explicit policy object improves the prior implied magic, but unenforced policy fields create a more dangerous false sense of control.

Recommendation

Convert `Policy` from descriptive configuration into enforced authority. Store the exact policy object on the run, classify only against that stored policy, enforce every declared constraint in code, and fail closed if any policy field is not enforceable. Remove or rename fields that are not actually honoured.

P0

Establish trust through inspectability

needs changesproduction blocker70/100

`RunLedger.append()` writes structured `AuditEvent` rows for run creation, approvals, submissions, successes, failures, cancellations, invoice snapshots, policy decisions, idempotency keys, and bank responses. That is materially more inspectable than the prior loop. But the ledger is only JSONL append-by-convention: there is no hash chain, HMAC/signature, immutable storage, sequence number, or integrity check, despite comments describing it as tamper-evident. It can also log a misleading `policy_hash` if `draft_tasks()` receives a different policy from the run policy. Delta: audit coverage improved, but audit integrity and policy-decision provenance remain insufficient for payment accountabi…

Recommendation

Move the audit ledger outside the execution loop into an append-only durable store with tamper evidence, such as a hash-chained event table or signed ledger. Include previous-event hash, sequence number, actor, exact policy hash used for classification, invoice snapshot hash, approval witness, transfer request hash, and response hash.

P0

Ensure that background work remains perceptible

needs changesproduction blocker68/100

The workflow has visible states via `RunStatus`, `TaskStatus`, and JSONL `AuditEvent`s, and `_submit_one()` records `task.submitted` before calling the bank. However, operational continuity is still process-local: `PaymentWorkflow.runs` and `PaymentWorkflow.tasks` are in-memory dictionaries, `inspect_run()` only reads those projections, and there is no replay path from `RunLedger.read_run_events()` after restart. A crash after `task.submitted` but before `task.succeeded`/`task.failed` would leave the persisted ledger inspectable only manually, not as a resumable workflow. Delta: this is a major improvement over the prior opaque loop, but persistence is not yet a usable lifecycle primitive.

Recommendation

Move run/task projections and execution cursor into a durable store or implement deterministic ledger replay before `inspect_run()` and `execute_approved()`. Treat `submitted_without_response` as an explicit recoverable state and persist cancellation flags outside the process.

P0

Design for delegation rather than direct manipulation

needs changesproduction blocker62/100

The code now models delegated work with `PaymentRun`, `PaymentTask`, and a `Policy` containing amount limits, vendor lists, due-window constraints, and approver-role intent. But the delegation contract is not yet authoritative: `draft_tasks()` accepts an arbitrary `policy` argument without checking `policy.hash() == run.policy_hash`, `Policy.max_run_total_pence` is never enforced, and `Policy.required_approver_role` is never checked. Delta: this improves the prior run/task/policy gap, but the current policy object is still partly advisory rather than a hard boundary.

Recommendation

Make the run-owned policy the single source of authority: persist the policy or signed policy envelope with the run, reject drafting/execution if the supplied policy hash differs from `run.policy_hash`, and enforce run-total, invoice, vendor, due-window, and approver-role constraints before any task can become executable.

P0

Optimise for steering, not only initiating

needs changesproduction blocker58/100

The workflow has an initial steering primitive: `cancel_run()` sets `cancellation_requested`, marks not-yet-submitted tasks `CANCELLED`, and `execute_approved()` checks the flag before each call to `transfer_funds()`. Idempotency keys also reduce duplicate-payment risk. But steering is incomplete and partly non-durable: the cancellation flag is only in memory, there is no pause/resume or reprioritisation path, failed tasks cannot actually be retried even though the inbox message says to retry from the operator inbox, and `execute_approved()` ignores `TaskStatus.FAILED` rather than offering a controlled replay with the same idempotency key. Delta: this improves the prior no-interrupt design,…

Recommendation

Persist steering state and execution cursor outside the process, then add explicit lifecycle operations for pause, resume, cancel, and retry/requeue of failed or submitted-unknown tasks using the original idempotency key. Keep the worker separate from the approval and steering surfaces so operators can intervene safely between external actions.

P0

Align feedback with the user’s level of attention

needs changesproduction blocker50/100

The code separates routine progress from intervention paths by recording normal transitions in `_record()` and escalating `BankAPIError` plus policy-hash mismatch through `OperatorInbox.post()`. But `OperatorInbox` is only an in-memory `list`, and `run_daily_batch()` creates it locally and discards it, so a critical `auth_missing` alert may never reach an operator in a cron/background context. Delta: this addresses the prior swallowed-exception pattern, but the escalation channel is still not durable or operator-facing.

Recommendation

Move `OperatorInbox` behind a durable operator-facing notification surface, such as a database-backed inbox, dashboard, or incident channel. Return or persist alert IDs from `run_daily_batch()` so critical failures cannot disappear with the process.

P0

Apply progressive disclosure to system agency

aligned

`inspect_run()` provides a concise top-level run view with `run_id`, `status`, `policy_hash`, cancellation fields, and a `summary` grouped by task status, then exposes per-task drill-down including `policy_decision`, invoice snapshot, approval identity/time, transfer ID, and failure details. Delta: this improves the prior lack of progressive inspection with a clear summary-plus-detail structure.

P0

Expose meaningful operational state, not internal complexity

aligned

The code exposes user-relevant lifecycle states through `RunStatus` values such as `awaiting_approval`, `executing`, `partially_completed`, and `failed`, plus `TaskStatus` values such as `awaiting_approval`, `approved`, `submitted`, `succeeded`, `failed`, `skipped`, and `cancelled`. Technical details are kept in diagnostic fields like `failure_class`, `failure_message`, and `bank_response` rather than replacing the operational state. Delta: this addresses the prior single-shot loop by making payment progress understandable at the run and task levels.

P0

Represent delegated work as a system, not merely as a conversation

aligned

The workflow is represented as a structured system rather than a conversation: `PaymentRun` owns child `PaymentTask`s, `RunStatus` and `TaskStatus` encode lifecycle, `_record()` emits timeline events, `execute_approved()` derives final run outcome from task states, and `inspect_run()` shows the run/task hierarchy. Delta: this resolves the prior lack of a system representation, although separate durability concerns are covered under P2 and P7.

Embed in your README

Two embeddable variants: a small flat shield and a richer score card.

Score card (recommended)

Blueprint Readiness Score card
[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/b8d61c00-2b86-45f0-a533-526202371592/card.svg)](https://aidesignblueprint.com/en/readiness-review/b8d61c00-2b86-45f0-a533-526202371592)

Flat badge

Blueprint Readiness Score badge
[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/b8d61c00-2b86-45f0-a533-526202371592.svg)](https://aidesignblueprint.com/en/readiness-review/b8d61c00-2b86-45f0-a533-526202371592)
Baseline and iteration details
Baseline: usedDoctrine: same doctrineRace: checked clear

Iteration delta

Improvements (9)

P1Design for delegation rather than direct manipulationneeds_changesneeds_changes
P2Ensure that background work remains perceptiblehigh_riskneeds_changes
P3Align feedback with the user’s level of attentionhigh_riskneeds_changes
P4Apply progressive disclosure to system agencyneeds_changesaligned
P6Expose meaningful operational state, not internal complexityhigh_riskaligned
P7Establish trust through inspectabilityhigh_riskneeds_changes
P8Make hand-offs, approvals, and blockers explicithigh_riskhigh_risk
P9Represent delegated work as a system, not merely as a conversationneeds_changesaligned
P10Optimise for steering, not only initiatinghigh_riskneeds_changes
Rubric: 2026-05-04

Run ID: b8d61c00-2b86-45f0-a533-526202371592 · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.