Skip to main contentSkip to footer
Production-ready case study

From silent transfer fabrication to cert-confirmed in ten iterations

A 156-line Python script drafted bank transfers, swallowed errors, and could mark transfers SUCCEEDED without ever talking to a real bank. Ten iterations of architect.validate + four architect.certify second-pass reviews on the production MCP. Validate is the first-pass principle-by-principle review; certify is a second-pass adversarial reviewer that runs only after validate aligns and looks for production blockers the principle pass couldn't see. The cert reviewer caught a different load-bearing failure mode every time, false-COMPLETE via still-AWAITING tasks, false-COMPLETE despite stale FAILED, stubbed bank in audit, money-at-risk projection error, blocker not durable across passes. Iter 10 is the first run where validate AND cert both pass cleanly.

Key Facts

Validator iterations
10 prod-MCP runs
Cert calls
4 second-pass reviews
Score trajectory
0/F → 100/A
Cert outcome
confirmed_production_ready
Doctrine compliance
10 / 10 principles aligned

Live validation

Two live readiness badges, before and after

Both runs are real validator outputs, not demos. The before badge links to the Iter1 baseline (0/F/draft, full ungoverned script); the after badge links to the Iter10 cert-confirmed run (100/A/production_ready). Every iteration in between is linked in the trajectory below.

Blueprint Readiness Score card, Iter10 cert-confirmed

After, Iter10 cert-confirmed

Validator + cert trajectory

Ten iterations, every run ID public

Each iteration is a real prod-MCP architect.validate call. Iter 4, 5, 6 then took architect.certify second-pass reviews and each downgraded the score from 100/A to 74/C/emerging because the cert reviewer caught a different production-blocking failure mode on each pass. Iter 7-9 then regressed on validate as the cert-layer fixes unmasked deeper gaps. Iter 10 is the first run where validate AND cert both pass cleanly, the cert reviewer found zero missed blockers.

IterChangeScoreValidateCertRun ID
Iter1Original 156-line script, no audit, no run_id, no approval gate0 / FHIGH_RISKac64d7d9-ce25-4d63-8537-4d866d78b8f1
Iter2Add tz-aware datetime + run_id surface + per-task state enum30 / FHIGH_RISKb8d61c00-2b86-45f0-a533-526202371592
Iter3Hash-chained audit ledger + signed approval envelope60 / CHIGH_RISK15aa9649-84d8-4a75-aec7-fd101a5b0535
Iter4Reject approvals on terminal runs (cert downgrade: false-COMPLETE via still-AWAITING tasks)74 / CALIGNED74 / C7e9bc0f6-8fd3-4586-851f-d26b720ac767
Iter5Run-status from whole task projection (cert downgrade: false-COMPLETE despite stale FAILED tasks)74 / CALIGNED74 / Cb4799966-efdb-4f2a-8f07-44811ecb7ff1
Iter6Explicit BANK_MODE env + mock=True tags (cert downgrade: stubbed bank → SUCCEEDED in audit)74 / CALIGNED74 / Cdd3a9348-7c1b-488e-930a-7f77d433aa6c
Iter7Broad except + typed `confirmed=True` contract (validate regression: iter-6 fix unmasked deeper gaps)40 / DHIGH_RISKb1195c34-8d8a-495e-a91a-d30ed551ecc3
Iter8SIMULATED_SUCCEEDED status + submission_blocked event (validate regression: SUBMITTED_UNKNOWN doesn't consume run-budget)30 / FHIGH_RISK5c1d5833-5f29-462f-b1a8-e774498c40fb
Iter9Exposure-from-projection budget guard + reconcile broad-except (regression: blocker not durable across passes)60 / CHIGH_RISK039064f9-c4e9-45de-8c16-1c160a01fca1
Iter10Durable preflight + BLOCKED_ON_RECONCILIATION + CANCELLATION_PENDING_RECONCILIATION states. Cert second-pass: confirmed.100 / AALIGNEDConfirmed29b080f4-b9a3-439d-bb07-cf666a18300d

Cert mandate value

Each cert call caught what the prior fix unmasked

First-pass validate alone returned aligned three times in a row (Iter 4-6). Three times in a row the second-pass cert reviewer surfaced a production-blocking failure mode the first-pass review couldn't see. Each fix to the cert finding triggered a NEW failure mode that only became reachable because the prior fix existed. This is the load-bearing argument for layered second-pass certification.

False-COMPLETE via still-AWAITING tasks (P8 sev 88)

execute_approved_payments() iterated through APPROVED tasks, skipped any non-APPROVED task without checking, then unconditionally set RunStatus.COMPLETED. Any task left in AWAITING_APPROVAL would have its run marked COMPLETED while still un-paid.

Fix: Count actionable tasks before terminal transition; hold in AWAITING_APPROVAL if any remain. Reject approvals on terminal runs.

False-COMPLETE despite stale FAILED tasks (P8 sev 9)

any_failed reset on each call; FAILED tasks from prior calls were ignored. A second execute pass with one new SUCCEEDED could mark COMPLETED while older FAILED tasks remained unpaid.

Fix: Derive terminal status from whole-task projection (failed_total, succeeded_total, actionable_total). Never from a per-pass flag.

Stubbed bank execution recorded as SUCCEEDED (P5 sev 10)

transfer_funds() returned a fabricated success dict without calling any real bank. The audit ledger recorded SUCCEEDED with a mock transfer_id; an auditor reading the ledger could not tell mock from real.

Fix: Explicit BANK_MODE env (live|mock), fail-closed on ambiguous; mock responses tagged mock=True; refuse mock→SUCCEEDED auto-promotion.

✅ confirmed_production_ready · zero missed blockers

The second-pass adversarial reviewer found no production_blocker, silent wrong-result path, or trust-boundary bypass. Verbatim summary: "the code visibly implements durable reconciliation blocking for ambiguous bank handoffs, explicit cancellation-pending states, signed approval/hash checks, audit/inbox inspectability."

Fix: Cert badge minted. Public review URL is live.

Principle scorecard

Every flagged principle, Iter1 vs Iter10, at a glance

Five principles across the trajectory fired as production_blocker / high_risk on at least one validate or cert pass; the Iter10 run closes all of them. The narrative below walks through each, this table is the scannable summary.

PrincipleClusterTrajectory peakIter10 production-ready
P1, Design for DelegationAuthorityHIGH_RISKALIGNED
P5, Replace Implied Magic with Clear Mental ModelsTrustHIGH_RISKALIGNED
P6, Expose Meaningful Operational StateVisibilityHIGH_RISKALIGNED
P8, Make Hand-offs and Blockers ExplicitOrchestrationHIGH_RISKALIGNED
P10, Optimise for SteeringOrchestrationHIGH_RISKALIGNED

Refactor scope

Iter1 ungoverned vs Iter10 cert-confirmed

Numbers verbatim from the package source. The original 156-line script had no audit, no run_id, no approval, no idempotency, no recovery. Iter 10 is ~2,384 lines because the cert reviewer surfaced a different load-bearing failure each cert call, every line earned its place against a specific finding.

AspectIter1 ungovernedIter10 cert-confirmed
Lines of code156~2,384 (durable preflight + 11-state lifecycle + projection-aware cancel)
Score0 / F / draft100 / A / production_ready
Aligned principles0 / 1010 / 10
Validate iterations10 prod-MCP runs
Cert calls (second-pass)4 (Iter 4, 5, 6 downgrade · Iter 10 confirmed)
Run lifecycle states1 (PENDING)11 (incl. BLOCKED_ON_RECONCILIATION, CANCELLATION_PENDING_RECONCILIATION)
Bank handoff outcomes modeled1 (success)5 (FAILED · SUBMITTED · SUBMITTED_UNKNOWN · SUCCEEDED · SIMULATED_SUCCEEDED)
Audit ledgerNoneHash-chained JSONL with verify_chain() tamper detection
Approval gateNoneHMAC-SHA256 envelope bound to run_id, task_id, policy_hash, invoice_snapshot_hash, approver role
Cert outcomeconfirmed_production_ready · 0 findings

Before / After

Before

The problem: every load-bearing guarantee was missing

The original script drafted invoices, called transfer_funds(), and returned. There was no run_id, so a process crash mid-run left no recoverable state. The except block swallowed every exception silently. There was no approval gate, the script auto-paid every classified-eligible invoice. There was no idempotency key, so a retry double-paid. The audit story was a single print() at the end of the loop.

# Iter1 baseline (excerpt, full file is 156 lines)
def execute_invoice_run():
    invoices = fetch_due_invoices()
    for invoice in invoices:
        if eligible_for_auto_pay(invoice):
            try:
                response = transfer_funds(
                    iban=invoice.iban,
                    amount=invoice.amount_pence,
                    currency=invoice.currency,
                )
                print(f"Paid {invoice.invoice_id}: {response}")
            except Exception:
                pass  # silent

# No run_id. No approval gate. No idempotency.
# No audit. No recovery. Crash mid-loop = orphaned bank requests.
# Retry = double-pay. Bank API stub = audit ledger lies.

After

After: durable preflight, projection-aware cancel, signed approval, hash-chained audit

Iter 10 refuses to make any bank call while a SUBMITTED_UNKNOWN task exists, the durable preflight is the load-bearing P8 fix. Cancel during exposure transitions to CANCELLATION_PENDING_RECONCILIATION (not terminal CANCELLED). Every approval is HMAC-bound to run_id + task_id + policy_hash + invoice_snapshot_hash + approver role, forging requires the secret. The audit ledger is hash-chained JSONL; verify_chain() detects any post-hoc edit. Mock submissions get TaskStatus.SIMULATED_SUCCEEDED + a simulated:bool field, never confused with real bank confirmation in any audit surface.

# Iter10 cert-confirmed: pre-execute durable preflight
def execute_approved_payments(self, *, run_id):
    run = self._require_run(run_id)
    # Iter 10 P8 cert-fix: refuse to start any submission pass
    # while ANY task is SUBMITTED_UNKNOWN. Blocker is durable
    # across calls. The operator MUST reconcile existing bank
    # exposure before any new external bank call.
    unresolved = [
        t for t in self.tasks[run_id].values()
        if t.status == TaskStatus.SUBMITTED_UNKNOWN
    ]
    if unresolved:
        self.ledger.append(
            run_id=run_id, task_id=None,
            event_type="run.execution_blocked_on_unresolved_exposure",
            payload={
                "unresolved_task_ids": [t.task_id for t in unresolved],
                "required_action": "reconcile_submitted_task or confirm_mock_simulation",
            },
        )
        run.status = RunStatus.BLOCKED_ON_RECONCILIATION
        return run  # NO bank call. NO new exposure.
    # ...

Validator output

What the cert reviewer found

The Blueprint MCP ran architect.validate ten times and architect.certify four times against this code base. Five principle-level production blockers were identified across the trajectory, each one a path for an irreversible bank transfer to fire under conditions the operator never authorised.

P1, Design for Delegation

execute_approved_payments() incremented the run-budget counter (run_total_submitted) only when a task hit SUCCEEDED. After SUBMITTED_UNKNOWN became reachable (iter 8), a task in that state, where the bank may already have accepted the transfer, was invisible to the cap check. The loop continued submitting more invoices, breaching policy.max_run_total_pence by ~20% in worst case (3 × £40k against a £100k cap).

P5, Replace Implied Magic with Clear Mental Models

Iter 6 stubbed transfer_funds() returned a synthetic success dict without ever calling a real bank. The audit ledger then recorded task.succeeded with a mock transfer_id, indistinguishable from a real bank-confirmed success. The mental model the operator saw, "the agent paid the vendor", was decoupled from what actually happened, "the agent fabricated success metadata".

P6, Expose Meaningful Operational State

After an ambiguous bank handoff the run held in RunStatus.AWAITING_APPROVAL, a state whose label suggests the operator's job is to approve the next thing. The actual job was to RECONCILE the existing exposure. The state name lied to the operator about what was needed; the same misalignment lived on the cancel path, where a run could become terminal CANCELLED while bank exposure remained unresolved.

P8, Make Hand-offs and Blockers Explicit

Iter 9 added an in-pass halt on SUBMITTED_UNKNOWN, but the blocker was not durable across calls. A second execute_approved_payments() would skip the unknown task (it was no longer APPROVED), recompute the budget projection, and submit the next approved invoice if cap room remained. Bank exposure could compound across two outstanding requests with the first one unreconciled.

P10, Optimise for Steering

cancel_run() flipped run.status to terminal CANCELLED unconditionally; even when SUBMITTED / SUBMITTED_UNKNOWN tasks remained. _reconcile_run_status() returned early on terminal runs, so no later reconciliation could fix the top-level state. An operator could cancel a run that still had outstanding bank requests and lose the audit trail's ability to distinguish "reconciled and cancelled" from "cancelled while money was still moving".

How each blocker was resolved

What the iterations fixed

Each iteration closed at least one production blocker. Iter 10 is the first run where every flagged principle reaches aligned and the second-pass cert reviewer finds zero missed blockers.

P1, Design for Delegation

Project at-risk exposure from the FULL task set, including SUBMITTED_UNKNOWN and SIMULATED_SUCCEEDED, before every submit. The cap check now runs against worst-case money-at-risk, not just confirmed success.

P5, Replace Implied Magic with Clear Mental Models

Explicit BANK_MODE env (live|mock), fail-closed on ambiguity. Mock responses tagged mock=True. New TaskStatus.SIMULATED_SUCCEEDED + simulated:bool field on tasks, surfaced separately in inspect_run() so a reviewer can never confuse mock-confirmed with bank-confirmed.

P6, Expose Meaningful Operational State

Two new explicit run states. BLOCKED_ON_RECONCILIATION replaces AWAITING_APPROVAL when the dominant action is reconcile-not-approve. CANCELLATION_PENDING_RECONCILIATION replaces terminal CANCELLED while SUBMITTED / SUBMITTED_UNKNOWN exposure remains; only after exposure clears via reconcile / confirm-mock does the run transition to terminal CANCELLED.

P8, Make Hand-offs and Blockers Explicit

Pre-execute durable preflight at the top of execute_approved_payments(). Scan the task projection; if any SUBMITTED_UNKNOWN exists, append run.execution_blocked_on_unresolved_exposure, post a critical inbox alert, transition the run to BLOCKED_ON_RECONCILIATION, and return without making any bank call. Only reconcile_submitted_task() / confirm_mock_simulation() can clear the blocker.

P10, Optimise for Steering

cancel_run() is now projection-aware. With in-flight bank exposure it transitions to CANCELLATION_PENDING_RECONCILIATION (not terminal CANCELLED). _reconcile_run_status() now excludes pending states from its early-return list, so once exposure clears via reconcile / confirm-mock, the run promotes to terminal CANCELLED with full audit-chain continuity.

Re-validation result

Iter 10: architect.certify confirmed production_ready

Iter 10's implementation was re-validated to 100/A/production_ready, then certified in the same prod-MCP session. The cert outcome was confirmed_production_ready, zero certification_findings. Verbatim summary from the cert reviewer: "the code visibly implements durable reconciliation blocking for ambiguous bank handoffs, explicit cancellation-pending states, signed approval/hash checks, audit/inbox inspectability, and no specific missed production-blocking crash, silent wrong-result path, or trust-boundary bypass is evidenced."

Iter1 (before)

High Risk · 0/F

0 of 10 principles aligned · no audit · no approval gate

Iter10 (after)

Aligned · Cert confirmed

10 of 10 principles aligned · 0 missed blockers

Time to fix

Ten iterations

Four cert calls · three cert downgrades caught · one cert confirmation

View the live readiness review →

Calculated ROI

Same metrics, same calculator powering every case study

Derived deterministically from this case study's profile (10 iterations, irreversible-financial blast radius, autonomous workflow, under compliance) via /lib/case-study-roi.ts. Numbers directly comparable to the other case studies.

Senior-architect time replaced

~255 hours @ $150/hour ≈ ~$38K per agent

Production ROI per agent / year

$120K – $280K (incident prevention + audit prep + rework)

Time to identify the governance gaps

2-4 weeks of senior-architect review WITHOUT Blueprint, ~50 min / 10 validator passes WITH Blueprint

Incidents prevented (range)

4-12 per year of unintended irreversible financial actions (each ~4-40 hours of incident-response / rollback)

Compliance audit prep

~80-120 hours / year replaced with one audit query

Related, Pro / Teams

Run this as a Blueprint Readiness Score

The Architect Agent is the same review pattern shown in this case study, applied to your code. Call architect.validate to get a Blueprint Readiness Score (0–100, A–F) per repository, and a regression diff between runs so the next review focuses on what changed.

Sample score card

B
82/ 100

Production-ready

▲ 7

acme/customer-agent

Run your own validation

Paste your agent code or describe your workflow. The validator returns principle-by-principle findings, a readiness score, and a shareable review URL in seconds. Reach 80+/A and architect.certify mints a public badge that matches the one above, after a second-pass adversarial review.