Skip to main contentSkip to footer
Production-ready case study

From silent submission to operator-governed in four iterations

An autonomous browser/form-fill agent's submission scope (click, submit, keypress) could fire payment forms, signups, and irreversible posts without human review. Four iterations of architect.validate (the Blueprint's MCP-served pre-prod readiness check) turned it into a governed system where every submission requires an operator decision, with public cert badges to prove it.

Key Facts

Validator iterations
4 prod-MCP runs
Score trajectory
68/C → 100/A
Production blockers closed
4 of 4 P0
Cert outcome
confirmed_production_ready
Doctrine compliance
10 / 10 principles aligned

Live validation

Two live readiness badges, before and after

Both runs are real validator outputs, not demos. The before badge links to the Iter1 baseline (68/C/emerging, four P0 blockers); the after badge links to the Iter4 cert-confirmed run (100/A/production_ready). Each prior iteration is linked in the trajectory below.

Blueprint Readiness Score card, Iter4 production-ready

After, Iter4 production-ready

Validator trajectory

Four iterations, every run ID public

Each iteration was a real prod-MCP architect.validate call. Iter2 closed two of the four P0 blockers but the score plateaued at Iter3 because the architect's prompt-deduplication detected the change-summary payload as too similar to the prior baseline. Iter4 re-fired with surgical lease-fence + bidirectional-audit changes and the score jumped 26 points to 100/A.

IterChangeScoreVerdictRun ID
Iter1Initial submission, ungoverned68 / CHIGH_RISKe78f225f-32ce-4426-a90b-4143212302be
Iter2Redaction + lease + dedup + immutability74 / CHIGH_RISK742680ee-26e9-4e7b-9bf9-0fbd4e6ead8c
Iter3Digest-only runs + queue-timeout reaper74 / CHIGH_RISK37cc23be-e74b-40d4-8703-a9366ca98910
Iter4Lease fence + bidirectional audit100 / AALIGNEDffcc637e-1280-4f34-8fed-57c7359d7466

Principle scorecard

Every flagged principle, Iter1 vs Iter4, at a glance

Four principles fired as production_blocker / high_risk on the Iter1 baseline; the Iter4 run closes all of them. The narrative below walks through each, this table is the scannable summary.

PrincipleClusterIter1 baselineIter4 production-ready
P0, Establish Trust Through InspectabilityTrustHIGH_RISKALIGNED
P2, Background Work PerceptibleVisibilityHIGH_RISKALIGNED
P3, Align Feedback with Attention LevelVisibilityHIGH_RISKALIGNED
P10, Optimise for SteeringOrchestrationHIGH_RISKALIGNED

Refactor scope

Iter1 ungoverned vs Iter4 production-ready

Numbers verbatim from the package source. The agent was already feature-complete at Iter1 (typed governance contract, approval gates, hash-chained audit ledger), the four iterations closed deeper seams: redaction-aware audit boundary, durable lease + watchdog, frozen policies, end-to-end evidence verification.

AspectIter1 ungovernedIter4 production-ready
Lines of code~3,362~3,417 (digest-only + lease fence)
Score68 / C / emerging100 / A / production_ready
Production blockers4 P0 flagged0
Aligned principles6 / 1010 / 10
Audit ledger payloadRaw policy + raw target + raw payloadDigest-only, evidence segregated
Process-death recoveryNone, run stuck in IN_PROGRESSWatchdog reaps stale-leased runs to TIMED_OUT
Action executor authorityCould fire under reaped leaseLease fence checks ownership before every executor call
Policy mutabilityImplicit, no enforcementFrozen at genesis (Pydantic + runtime digest re-derive)
Cert outcomeconfirmed_production_ready

Before / After

Before

The problem: a stolen lease could ship a click

The agent claimed an in-memory asyncio task at start. If the process died, the run sat in IN_PROGRESS with no terminal path. The audit ledger logged raw policy, target_text, and form payloads, credentials and PII end up in the same hash chain that operators audit against. _wait_for_resume swallowed lease-loss silently, letting the SDK advance past the pause boundary and fire an irreversible click under a lease the watchdog had already reaped.

# Iter1 audit ledger row (excerpt)
{
  "kind": "tool_call_completed",
  "data": {
    "action_name": "click",
    "target_text": "Submit payment of £21,500 to vendor X",
    "action_payload": {"x": 412, "y": 580, "card_number": "4111-1111-1111-1111"},
    "result": "executed:click:Submit payment of £21,500..."
  }
}

# In-process asyncio task with no durable handle
# Process dies → run stuck in IN_PROGRESS forever
# verify_audit() returns True even when evidence has been swapped

After

After: every submission gated, every digest verified

The Iter4 build commits only digests + structured size markers to the hash chain. Raw payloads land in a segregated evidence table with separate access trust. The runs row is digest-only, no plaintext leak path. A durable lease + watchdog reaper covers BOTH 'process died mid-loop' and 'external worker never claimed it' branches. Every action executor invocation passes through a heartbeat fence: a stolen lease cannot ship a click.

# Iter4 audit ledger row (same event, redacted)
{
  "kind": "tool_call_completed",
  "data": {
    "action_name": "click",
    "target_text": {"_redacted": true, "digest": "8c4f...", "value_kind": "str", "size": 38},
    "action_payload": {"_redacted": true, "digest": "a91b...", "value_kind": "dict", "size": 84},
    "result": {"_redacted": true, "digest": "f2d3...", "value_kind": "str", "size": 47}
  }
}

# verify_audit recomputes sha256(evidence.value_json):
# three-way bind (recomputed == value_digest column == chain marker)
# plus bidirectional check (every marker has evidence ⇔ every evidence has marker)
# Tampering either side breaks certification.

Validator output

What the validator found

The Blueprint MCP ran architect.validate against the Iter1 baseline. Four P0 production blockers identified, each one a path for an irreversible action to fire without operator authority.

P0, Establish Trust Through Inspectability

Audit ledger logged raw policy_json + raw action_payload + raw target_text + raw error reprs that may carry credentials or PII in form-fill agents. The runs row stored raw policy_json. verify_audit checked digest markers but never recomputed sha256 from stored evidence, so tampered evidence could pass verification.

P2, Background Work Perceptible

Execution lifecycle had no durability: asyncio.create_task held the live task only on the in-memory handle; timeout was enforced only by asyncio.wait_for. If the process died, the run was stuck in IN_PROGRESS forever. INITIALISED runs without a lease could not be reaped, the predicate required lease_expires_at IS NOT NULL.

P3, Align Feedback with Attention Level

CLI _stream_attention_required deduplicated by RunState set, so a SECOND AWAITING_APPROVAL transition (multi-step form, every submit needs operator approval) was silently swallowed. The operator would see the first approval announce and never know about subsequent ones, a multi-page form would feel like the run had hung.

P10, Optimise for Steering

_wait_for_resume swallowed LeaseLost (silent return), letting Runner.run see the pause boundary as completed and advance to the next step under a lease the watchdog had reaped. Action executor invocation had no lease/state fence, so a stolen lease could let an irreversible click ship after the operator's intent was overridden.

How each P0 was resolved

What the iterations fixed

Each iteration closed at least one production blocker. Iter4 is the first run that crosses 80/A/production_ready, every P0 finding above is now aligned at 100/100.

P0, Establish Trust Through Inspectability

Redaction-aware boundary: every sensitive field becomes {_redacted, digest, value_kind, size} in the hash chain; raw values land in a segregated evidence sidecar with separate access trust. Runs row is digest-only. verify_audit recomputes sha256(evidence.value_json) from raw bytes and requires three-way agreement with the value_digest column AND the chain marker AND a bidirectional check that every redacted marker has matching evidence.

P2, Background Work Perceptible

Durable lease columns on the runs table (worker_id, heartbeat_at, lease_expires_at). Hooks heartbeat at every checkpoint. Watchdog reaper transitions stale-leased runs to TIMED_OUT. create_run sets initial lease_expires_at = now + queue_timeout so unclaimed INITIALISED runs are reapable via the same predicate, covering BOTH 'process died mid-loop' and 'external worker never claimed it' branches.

P3, Align Feedback with Attention Level

Dedup keyed by (kind, seq) tuple, every state_transition event with seq=N is announced exactly once, regardless of which RunState the run cycled through. Multi-step approval flows now surface every gate.

P10, Optimise for Steering

Policies frozen at genesis (Pydantic ConfigDict(frozen=True) + runtime digest re-derive at loop start). _wait_for_resume now propagates LeaseLost as AbortRequested so the SDK loop unwinds cleanly. GovernedActionDispatcher.dispatch performs a lease/state heartbeat fence immediately before action_executor invocation; LeaseLost at the fence emits a typed lease_fence_failed event and re-raises as ScopeViolation.

Re-validation result

After Iter4: architect.certify confirmed production_ready

The Iter4 implementation was re-validated and then certified in the same prod-MCP session. Cert outcome: confirmed_production_ready. The badge is live and the readiness review is publicly inspectable.

Iter1 (before)

High Risk

4 P0 blockers · 6 of 10 principles aligned

Iter4 (after)

Aligned · Cert confirmed

0 P0 blockers · 10 of 10 principles aligned

Time to fix

Four iterations

From flagged P0 blockers to confirmed_production_ready

View the live readiness review →

Calculated ROI

Same metrics, same calculator powering every case study

Derived deterministically from this case study's profile (4 iterations, irreversible-financial blast radius, autonomous workflow, under compliance) via /lib/case-study-roi.ts. Numbers directly comparable to the other case studies.

Senior-architect time replaced

~135 hours @ $150/hour ≈ ~$20K per agent

Production ROI per agent / year

$120K – $280K (incident prevention + audit prep + rework)

Time to identify the governance gaps

2-4 weeks of senior-architect review WITHOUT Blueprint, ~20 min / 4 validator passes WITH Blueprint

Incidents prevented (range)

4-12 per year of unintended irreversible financial actions (each ~4-40 hours of incident-response / rollback)

Compliance audit prep

~80-120 hours / year replaced with one audit query

Related, Pro / Teams

Run this as a Blueprint Readiness Score

The Architect Agent is the same review pattern shown in this case study, applied to your code. Call architect.validate to get a Blueprint Readiness Score (0–100, A–F) per repository, and a regression diff between runs so the next review focuses on what changed.

Sample score card

B
82/ 100

Production-ready

▲ 7

acme/customer-agent

Run your own validation

Paste your agent code or describe your workflow. The validator returns principle-by-principle findings, a readiness score, and a shareable review URL in seconds. Reach 80+/A and cert mints a public badge that matches the one above.