Revisionato

Valutazione completata; in attesa di revisione delle prove.

Agent Architecture Review, Snapshot di validazione

Valutato il 10 maggio 2026 rispetto alla doctrine di AI Design Blueprint

Alto rischio

Stato: Alto rischio

40/100

Voto D

4 allineati6 blocker produzione3 alto rischio

Blueprint Readiness misura l'allineamento alla doctrine, non la runtime correctness. Un verdetto production-ready significa che l'architettura incarna i 10 principi; non esegue i tuoi test o i tuoi tipi. Affiancala alla tua test suite, non sostituirla.

Verdetti per principio

The workflow is correctly classified as an autonomous payment workflow and contains strong primitives: run/task state, HMAC approval envelopes, a hash-chained ledger, durable inbox alerts, pause/cancel/retry/reconcile paths, and an explicit mock mode that no longer auto-marks mock submissions as succeeded. However, production trust still fails around the bank handoff and status model: the live bank path raises an uncaught NotImplementedError after the task has already been recorded as submitted, any future non-mock bank response is treated as SUCCEEDED without typed confirmation, and mock simulations are ultimately exposed as the same task status as real success. Those are production blockers for a delegated finance workflow.

Storico iterazioni

5 run precedenti su questo artefatto. Ogni run_id apre la sua readiness review.

I punteggi possono salire o scendere tra le iterazioni: il ragionamento del validator non è strettamente deterministico, quindi lo stesso artefatto può ottenere punteggi diversi tra una run e l'altra. I delta per principio più sotto mostrano il cambiamento sostanziale.

Quando	Score	Tier	Run ID
10 maggio 2026 (questa run)	40 / D	Bozza	b1195c34…
10 mag 2026	74 / C	In sviluppo	dd3a9348…
10 mag 2026	74 / C	In sviluppo	b4799966…
10 mag 2026	74 / C	In sviluppo	7e9bc0f6…
10 mag 2026	60 / C	In sviluppo	15aa9649…
10 mag 2026	30 / F	Bozza	b8d61c00…

Findings per principio

10 principi valutati. Verdict, severity, evidenza e raccomandazione per ognuno.

Alto rischioBlocker di produzione92/100

Make hand-offs, approvals, and blockers explicit

Approval handoffs are explicit through `ApprovalEnvelope`, HMAC verification, role checks, and `AWAITING_APPROVAL`, and mock responses are no longer auto-promoted. The external bank handoff is still unsafe: `_submit_one()` records `task.submitted` before the bank call has succeeded; `_post_bank_transfer_live()` is deliberately unwired and raises `NotImplementedError` that is not converted into a workflow blocker; and any future non-mock response is treated as `SUCCEEDED` without typed confirmation. This can leave operators with a false or ambiguous payment state at the exact externally visible action boundary.

Raccomandazione

Architecturally separate bank authority into a real bank-transfer service/client with a typed result envelope. Only record `SUBMITTED` after bank acceptance, only record `SUCCEEDED` after explicit bank-confirmed success, and convert adapter-unavailable or ambiguous responses into `SUBMITTED_UNKNOWN`/`BLOCKED` with reconciliation instructions.

Alto rischioBlocker di produzione88/100

Replace implied magic with clear mental models

The explicit `BANK_MODE` primitive and `mock=True` tagging improve the mental model, but two status semantics still imply more certainty than the system has. First, `_submit_one()` marks any non-mock response as `TaskStatus.SUCCEEDED` without checking a typed success/settlement field, response status, or transfer acceptance contract. Second, `confirm_mock_simulation()` promotes a mock task to `TaskStatus.SUCCEEDED`, while `inspect_run()` exposes only `status: "succeeded"`, `transfer_id`, and no `simulated`/`mock` flag. A downstream reviewer can confuse a simulated success or an unvalidated bank response with real payment completion.

Raccomandazione

Use distinct operational semantics: add `SIMULATED_SUCCEEDED` or a required `simulated: bool` task field exposed in `inspect_run()`, and require a typed bank response such as `TransferConfirmed` before assigning real `SUCCEEDED`. Do not let an arbitrary non-mock dict imply payment success.

Alto rischioBlocker di produzione85/100

Ensure that background work remains perceptible

Most background work is perceptible through `RunStatus`, `TaskStatus`, `RunLedger`, `DurableOperatorInbox`, and `inspect_run()`. But the live bank path breaks perceptibility: `_submit_one()` sets `task.status = TaskStatus.SUBMITTED` and appends `event_type="task.submitted"` before calling `transfer_funds()`, while `_post_bank_transfer_live()` raises `NotImplementedError` and `_submit_one()` catches only `BankAPIError`. A live-mode run can therefore crash with the run left `EXECUTING` and the task recorded as `SUBMITTED` without a durable failure alert or blocker. Delta: despite the prior aligned baseline, this current submission exposes a concrete unhandled live-adapter path.

Raccomandazione

Move the bank handoff behind a typed adapter/service boundary that returns explicit `accepted`, `confirmed`, `rejected`, or `unknown` results; preflight that the live adapter is wired before recording `task.submitted`; and catch all adapter exceptions into a durable `task.submission_blocked` or `task.submitted_unknown` ledger event plus critical inbox alert.

Richiede modificheBlocker di produzione70/100

Align feedback with the user’s level of attention

The code does escalate many attention-worthy states: `inbox.post()` is used for forged approvals, expired envelopes, insufficient role, bank `auth_missing`, policy failures, unfinished tasks, and mock submissions. However, the highest-risk live-adapter failure is not converted into an operator-facing alert because `NotImplementedError` from `_post_bank_transfer_live()` is uncaught after `task.submitted` is written. The operator is not given a calibrated foreground/background signal that the bank client is unavailable and no real submission occurred.

Raccomandazione

Handle live-adapter-unavailable and unexpected bank-client exceptions as first-class workflow states with a critical inbox message that says exactly what happened, whether the bank may have seen the request, and the next safe action: wire live client, reconcile, retry, or cancel.

Richiede modificheBlocker di produzione70/100

Establish trust through inspectability

There is a strong inspectability base: `Invoice.snapshot_hash()`, `Policy.hash()`, HMAC-bound `ApprovalEnvelope`, append-only `RunLedger`, `event_hash`/`prev_event_hash`, `verify_chain()`, and `inspect_run()` all support traceability. The audit chain is incomplete on important failure paths: `_post_bank_transfer_live()` can raise uncaught `NotImplementedError` after `task.submitted`, leaving no ledgered failure/blocker event; expired approval envelopes post an inbox alert but do not append a rejection event, and insufficient-role rejections also lack a ledger append. For a payment workflow, rejected approvals and bank adapter failures must be hash-chain visible, not only thrown or posted out…

Raccomandazione

Make every approval rejection and every bank-adapter exception append a typed ledger event before returning/raising. For production, separate the audit ledger from the execution process or anchor the hash-chain tail externally so the ledger is not merely a mutable local JSONL file.

Richiede modificheBlocker di produzione65/100

Expose meaningful operational state, not internal complexity

The enum model is generally user-relevant (`AWAITING_APPROVAL`, `APPROVED`, `SUBMITTED_UNKNOWN`, `SUCCEEDED`, `FAILED`, `PAUSED`, `CANCELLED`) and `inspect_run()` summarizes tasks by status. The operational state becomes misleading at the bank boundary: `task.submitted` is recorded before the adapter successfully accepts the submission, and mock-confirmed tasks are shown as ordinary `succeeded` tasks in `inspect_run()`. Those states do not reliably tell the operator whether money moved, a request was merely attempted, or a simulation was acknowledged.

Raccomandazione

Split submission and completion states into user-meaningful phases such as `SUBMISSION_ATTEMPTING`, `SUBMITTED_ACCEPTED`, `SUBMITTED_UNKNOWN`, `CONFIRMED_PAID`, and `SIMULATED_SUCCEEDED`, and expose `bank_mode`/`simulated` in the primary inspection output.

Allineato

Design for delegation rather than direct manipulation

The workflow is designed around delegated intent and constraints rather than direct manual payment execution: `Policy` captures amount caps, vendor allow/block lists, due windows, and required approver role; `create_run()` establishes a delegated `run_id`; `draft_tasks()` turns invoices into governed `PaymentTask`s; `approve()` uses a signed `ApprovalEnvelope`; and `execute_approved_payments()` executes only approved work under policy checks. This maintains the prior aligned delegation structure.

Allineato

Apply progressive disclosure to system agency

The code uses reasonable progressive disclosure: `inspect_run()` provides a primary operational view with run status, policy hash, audit-chain integrity, critical alerts, summary counts, and per-task status; deeper details remain available through `RunLedger.read_run_events()` and `verify_chain()`. The primary view emphasizes outcome and required attention rather than dumping the raw JSONL ledger by default.

Allineato

Represent delegated work as a system, not merely as a conversation

Delegated work is represented as a structured system rather than a conversation: `PaymentRun` and `PaymentTask` model lifecycle state, `RunStatus`/`TaskStatus` encode progress, `RunLedger` provides a timeline, `DurableOperatorInbox` captures intervention-required alerts, and `inspect_run()` returns a task/status view with summary counts and reconciliation needs. This maintains the prior aligned system representation.

P10

Allineato

Optimise for steering, not only initiating

The workflow includes steering primitives beyond initiation: `pause_run()`, `resume_run()`, `cancel_run()`, `resume_after_failure()`, `retry_failed_task()`, `reconcile_submitted_task()`, and `confirm_mock_simulation()` allow operators to interrupt, resume, retry, reconcile, or explicitly acknowledge mock submissions. `execute_approved_payments()` checks `ledger.latest_steering_intent(run_id)` between tasks, so ledgered pause/cancel intent can steer an in-progress run at task boundaries.

Aggiungi al tuo README

Due varianti embeddabili: una piccola e una a card più ricca.

Score card (consigliata)

[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/b1195c34-8d8a-495e-a91a-d30ed551ecc3/card.svg)](https://aidesignblueprint.com/en/readiness-review/b1195c34-8d8a-495e-a91a-d30ed551ecc3)

Badge piatto

[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/b1195c34-8d8a-495e-a91a-d30ed551ecc3.svg)](https://aidesignblueprint.com/en/readiness-review/b1195c34-8d8a-495e-a91a-d30ed551ecc3)

Dettagli baseline e iterazione

Baseline: usedDoctrine: same doctrineRace: checked clear

Delta iterazione

0 chiusi in questa iterazione6 riaperti3 finding ad alto rischio ancora aperti

Regressioni (6)

P2Ensure that background work remains perceptiblealignedhigh_risk

P3Align feedback with the user’s level of attentionalignedneeds_changes

P5Replace implied magic with clear mental modelsalignedhigh_risk

P6Expose meaningful operational state, not internal complexityalignedneeds_changes

P7Establish trust through inspectabilityalignedneeds_changes

P8Make hand-offs, approvals, and blockers explicitalignedhigh_risk

Rubric: 2026-05-04

Esegui la tua validazione AI Design Blueprint

Run ID: b1195c34-8d8a-495e-a91a-d30ed551ecc3 · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.