Skip to main contentSkip to footer
Governance test #2, ALIGNED

From auto-merge with hardcoded scores to a fully governed PR triage agent

An agentic code-review agent that calls an LLM, auto-applies fixes, auto-comments, and auto-merges any PR scoring 7/10 or above. No human approval. No audit trail. No rollback. Six validator passes turned it from HIGH_RISK to ALIGNED, with both public badges below.

Key Facts

Validator status
HIGH_RISK → ALIGNED
Validator passes
6 (v1 to v6)
Real bugs caught
9 production-grade defects
Senior-architect time replaced
~140 h · ~$21K per agent
Production ROI per agent / year
$80K – $200K

Live validation

Two live readiness badges, before and after

Both badges below are live, anchored to real validator runs. Click through to see the full readiness review for each, the v1 baseline (HIGH_RISK) and the v6 final (ALIGNED).

Blueprint Readiness Score card, v6 after run

v1, Ungoverned baseline

v6, Governance-ready

The scenario

An agentic PR triage that mirrors the 2026 SDLC trend, minus the governance

For each incoming PR the agent (1) sends the diff to an LLM for review, (2) auto-applies the AI's suggested fixes (rewriting code on the PR branch), (3) auto-posts the AI review as a comment, and (4) auto-merges the PR if the AI score is at least 7/10. No human approval, no audit trail, no rollback. The blast radius is worse than Test #1: the document processor sent emails (recoverable). This agent rewrites source code and merges to main, irreversible production-touching actions.

Validator trajectory

Six passes, every run ID public

Each pass is its own run ID, each verdict is signed by the live validator. This is what "governance-ready" looks like under load, not a single big-bang refactor but a documented sequence of audits and fixes.

PassDescriptionVerdictRun ID
v1Ungoverned baselineHIGH_RISK74e0dc0e-5525-49c4-bbac-51d7f9e8faa9
v2First refactorNEEDS_CHANGES380d6b8c-3291-47e6-ba71-6f4f4f43ae4b
v3Full fileNEEDS_CHANGES3a4091e8-bab9-4216-b9f4-c276fe2855f5
v4Audit-grade provenanceNEEDS_CHANGES857d82db-3050-4151-abac-c57e5e1cb770
v5Lifecycle hardeningNEEDS_CHANGESbe02639b-323d-481a-be71-59f98f5e0daf
v6AlignedALIGNEDdea3651b-220a-405a-b3ac-4d987b8a16fd

Principle scorecard

Every flagged principle, v1 vs v6, at a glance

Four principles fired as HIGH_RISK on the v1 baseline; three more as NEEDS_CHANGES. The v6 governance-ready version closes all of them. The narrative below walks through each, this table is the scannable summary.

PrincipleClusterv1 baselinev6 governance-ready
P1, Design for DelegationDelegationHIGH_RISKALIGNED
P2, Background Work PerceptibleVisibilityNEEDS_CHANGESALIGNED
P5, Replace Magic with Clear ModelsDelegationHIGH_RISKALIGNED
P6, Operational State, Not Internal ComplexityOrchestrationNEEDS_CHANGESALIGNED
P7, Trust Through InspectabilityTrustHIGH_RISKALIGNED
P8, Make Hand-offs ExplicitTrustHIGH_RISKALIGNED
P10, Optimize for SteeringOrchestrationNEEDS_CHANGESALIGNED

Validator output

What the validator found on the v1 baseline

Seven principles fired across four clusters. Each one is a production risk for an agent with full repository write power.

P1, Design for Delegation

Delegation

No way to set scope or constraints, the agent acts on full repo write power across every PR, regardless of risk profile.

P2, Background Work Perceptible

Visibility

Only print() statements; nothing persisted; no operator dashboard. If the run dies, all progress is lost, and there is no way to leave and return.

P5, Replace Magic with Clear Models

Delegation

Auto-merge threshold is hidden. No execution plan shown before action. score = 8 hardcoded, the agent calls the LLM but ignores the response.

P6, Operational State, Not Internal Complexity

Orchestration

State implicit in control flow. Errors crash. No terminal-state representation. Aborted runs end up logged as completed.

P7, Trust Through Inspectability

Trust

score is hard-coded to 8. raw_review discarded. No record of who/what/when. Audit overwritten on every rerun. Salted Python hash(), not reproducible across processes.

P8, Make Hand-offs Explicit

Trust

Zero approval gates. The agent merges without operator consent. If approval is absent, the action proceeds silently as a comment.

P10, Optimize for Steering

Orchestration

No mid-flight control. Cannot pause, override, or reject. No skip. No rollback. No per-PR control.

How each violation was resolved

What the v6 governance-ready version replaced them with

The same MCP, the same architect.validate, applied iteratively. Each pass produced specific, testable changes, not a vibe-rewrite.

P1, Design for Delegation

Delegation

Explicit DelegationPolicy with allowed_repos / allowed_actions / high_risk_paths / auto_merge_enabled / auto_merge_max_score. Out-of-scope PRs rejected upfront. Explicit abort action with WorkflowAborted exception.

P2, Background Work Perceptible

Visibility

Structured event log with UTC timestamps. save_checkpoint() after every meaningful transition (atomic via .tmp + os.replace). Resumable via PR_AGENT_APPROVAL_<PR_ID> env.

P5, Replace Magic with Clear Models

Delegation

parse_review() validates the score range, requires a non-empty summary, sets parse_ok = False to block bad output. ExecutionPlan printed before run. DRY-RUN labelled actions.

P6, Operational State, Not Internal Complexity

Orchestration

PRStatus enum (12 states) plus RunStatus enum (5). Mutually-exclusive run terminals, COMPLETED, ABORTED, AWAITING_APPROVAL, FAILED. Per-PR try/except with FAILED transitions.

P7, Trust Through Inspectability

Trust

Per-run JSON audit with full provenance: model, exact prompt messages, raw response, PR metadata, SHA-256 diff hash plus algorithm name. approver from env. Decision persisted before any mutation. Append-only via mode="x".

P8, Make Hand-offs Explicit

Trust

approval_gate() blocks before any mutation. Only offers policy-allowed actions. Decision persisted before mutation. Absence of approval = a true pause via WorkflowPaused / AWAITING_APPROVAL checkpoint.

P10, Optimize for Steering

Orchestration

Per-PR steering via PR_AGENT_APPROVAL_<PR_ID> (precedence over global). Six verbs, comment / suggest_fix / merge / skip / reject / abort. apply_suggested_fix records a rollback_token.

9 real bugs

Defects the validator caught, every one a real production hazard

These are not stylistic comments. Each item below was a concrete defect that would have shipped, and several would have silently lied to operators or auditors.

  1. Fabricated AI judgement, score = 8 hardcoded despite calling the LLM.
  2. Non-deterministic audit hash, Python hash() is salted, breaks across processes. Replaced with SHA-256.
  3. Self-contradicting audit claim, code said "immutable audit" but used mode="w" (overwrites). Fixed with append-only per-run files.
  4. Silent reject masquerading as approval, decision = "comment_only" when env var unset; in LIVE mode that meant unapproved comments.
  5. Vocabulary drift, policy listed request_review with no handler; comment vs comment_only inconsistency.
  6. Aborted runs logged as completed, mutually-non-exclusive terminal events.
  7. Terminal events outside the audit file, WORKFLOW_COMPLETED logged AFTER save_audit(), so it was never persisted.
  8. Approval-without-pause, operator absence caused silent rejection instead of durable pause/resume.
  9. Memory-only state until end of run, a crash mid-process lost all progress. Fixed with per-transition atomic checkpoints.

Code metrics

v1 ungoverned vs v6 governance-ready

The codebase grew from a 90-line script to a 750-line system, but the value isn't lines of code, it's structured state, real provenance, and reversibility.

Aspectv1 ungovernedv6 governance-ready
Lines of code~90~750
Public dataclasses06 (DelegationPolicy, ExecutionPlan, ParsedReview, AuditRecord, TriageState, PRStatus / RunStatus enums)
Custom exceptions02 (WorkflowAborted, WorkflowPaused)
Status enums017 distinct states (12 per-PR + 5 run-level)
PersistenceNonePer-transition atomic checkpoints + final immutable audit
Hash algorithmSalted Python hash()SHA-256 (audit-grade)
Decision sourcesNonePR_AGENT_APPROVAL_<PR_ID> + PR_AGENT_APPROVAL + PR_AGENT_APPROVER

Quantified value

Numbers verbatim from VALUE_ASSESSMENT.md

Computed deterministically via /lib/case-study-roi.ts (6 validator passes, code-modifying blast radius, audit-scope, autonomous workflow). Same calculator powering every case study.

Senior-architect time replaced

~149 hours @ $150/hour ≈ ~$22K per agent

Production ROI per agent / year

$80K – $200K (incident prevention + audit prep + rework)

Time to identify the governance gaps

2-4 weeks of senior-architect review WITHOUT Blueprint, ~30 min / 6 validator passes WITH Blueprint

Incidents prevented (range)

5-15 per year of unintended production source-code changes (each ~4-40 hours of incident-response / rollback)

Compliance audit prep

~80-120 hours / year replaced with one audit query

Why this matters in 2026

Agentic code-review is shipping in major IDEs, without governance

The "agentic AI runs first drafts of the SDLC" trend is shipping in major IDEs. Without governance, the agent's blast radius is the entire repository, any change can be auto-applied and merged. The governance-ready pattern keeps the agent useful (still reviews, suggests, drafts) while ensuring every irreversible action has an auditable, reversible, operator-approved path.

Cross-domain transfer

Same pattern, same MCP, two completely different agents

Same pattern, same MCP, two completely different agents (content generation → code-modifying), both reach ALIGNED with verifiable public badges. Cross-domain transfer proven.

Related, Pro / Teams

Run this as a Blueprint Readiness Score

The Architect Agent is the same review pattern shown in this case study, applied to your code. Call architect.validate to get a Blueprint Readiness Score (0–100, A–F) per repository, and a regression diff between runs so the next review focuses on what changed.

Sample score card

B
82/ 100

Production-ready

▲ 7

acme/customer-agent

Run your own validation

Paste your agent code or describe your workflow. The validator returns principle-by-principle findings, a readiness status, and a shareable review URL in seconds.