Skip to main contentSkip to footer
For agents · Pro / Teams

The Architect Agent

Call architect.validate to get a Blueprint Readiness Score (0–100, grade A–F) on real code. The Architect Agent reviews your implementation against the 10 Blueprint principles, returns per-principle verdicts WITH numeric severity and confidence, and persists the run with a full reproducibility envelope so two callers with the same input can verify they got the same answer. Pro and Teams plans only.

Pro and Teams members. The Architect Agent is the Blueprint's authenticated review surface, it processes your code transiently with a strict zero-training policy, hardens the prompt boundary against injection in submitted code, and supports private_session=true to skip all server-side logging.

What you get back

Every run returns a structured response with: (1) `assessment` — overall status, summary, confidence, AND `code_classification` (autonomous_agentic_workflow vs non_agentic_component, with rationale) so you can see WHY some principles are marked not_applicable; (2) per-principle findings — verdict (aligned, mixed, needs_changes, high_risk, not_applicable), `severity_score` 0–100, `confidence` (low/medium/high), `evidence_quality` (sparse/moderate/strong), code-cited evidence, and a recommendation; (3) `readiness` — the Blueprint Readiness Score (0–100), grade (A–F), tier (production_ready / emerging / draft), counts per verdict bucket, whether grade was capped by a high_risk finding, and the rubric_version; (4) `recommended_examples` with `example_recommendation_status` so the run still completes if no curated examples match; (5) `processing` — `llm_latency_ms`, `total_latency_ms`, `timeout_budget_seconds`, `dependency_status`; (6) `reproducibility` — model, seed, system_fingerprint, doctrine_fingerprint, prompt_template_fingerprint, reasoning_effort, and `reproducibility_mode='best_effort'`; (7) `persistence_status` — `saved` or `failed`, with run_id / badge_url / review_url only when the durable write succeeded.

Severity_class scoring (production_blocker vs hardening_recommended)

Two `needs_changes` findings can have very different impact. A token-budget cap as defence-in-depth is not the same as an untyped error path that strands a real user. Each finding now carries a `severity_class` orthogonal to the verdict label: `production_blocker` (trust boundary fails, must fix before prod, contributes 0 credit), `hardening_recommended` (trust boundary holds, defence-in-depth note for the next iteration, full credit), `polish` (stylistic, full credit). The headline grade penalises only `production_blocker` and the legacy `high_risk` verdict; `hardening_recommended` and `polish` surface in a separate next-iteration list without dragging the score down. This lets `production_ready` mean trust boundaries hold rather than 100/100 — chasing the perfect score is the perfection-loop trap (Ford & Parsons fitness-function framing). Older runs without severity_class fall back to the legacy verdict + severity_score interpolation and grade exactly as they did before.

Reproducibility envelope (best-effort, but auditable)

Two callers submitting identical input get the same `seed`, derived from a collision-free JSON canonicalisation that covers every prompt-affecting field. The response carries four fingerprints so any divergence is diagnosable: `system_fingerprint` (OpenAI provider backend), `doctrine_fingerprint` (the principle definitions used), `prompt_template_fingerprint` (system prompt + scaffolding + JSON schema + model + reasoning_effort), and the seed itself. If a future deploy changes the system prompt or the doctrine, the corresponding fingerprint changes — silently breaking determinism is impossible by construction. The mode is explicitly `best_effort`: OpenAI seed gives stable sampling, not byte-identical replay. Per-finding `confidence` lets you tell a real disagreement from intrinsic LLM variance.

Trend memory across runs

Pass repository="<your-repo-name>" to architect.validate and the Architect Agent groups runs by project. Then call me.validation_history to see your latest score per repository, the delta versus the previous run, and the exact principles that regressed since last time. Agents can read this context before re-validating to focus the next review on what is actually changing, instead of re-flagging issues you have already fixed.

Sample score card

What you see after every architect.validate run on a repository.

Repository

acme/customer-agent

Grade

B

Score (0–100)

82/ 100

Tier

Production-ready

Change vs. previous run

7

Aligned principles this run

9 of 12 (75%)

Regressed since last run

Principles that flipped from aligned → not aligned. Focus the next review here.

  • P5Make hand-offs and approvals explicit (P5)

Improved since last run

Principles that flipped from not aligned → aligned. Confirm what changed and don't re-flag.

  • P3Surface meaningful operational state (P3)
  • P8Constrain the agent to recoverable actions (P8)

Architect Agent guidance

Score improved by 7 since the previous run on this repository. Confirm what changed before suggesting further work, focus the next review on the regressed principle (P5) instead of re-checking principles that are already aligned.

How the grade is computed

Formula

score = round(100 × Σ credit_per_finding / applicable_principles)

Each finding contributes credit based on its severity_class, not its verdict alone. production_blocker = 0 (must fix before prod). hardening_recommended = 1.0 (trust boundary holds; defence-in-depth note for the next iteration). polish = 1.0 (stylistic, non-load-bearing). aligned = 1.0. The legacy high_risk verdict still caps the grade at C. production_ready means trust boundaries hold, not 100/100 — chasing the perfect score is the perfection-loop trap (Ford & Parsons). Older runs without severity_class fall back to the legacy verdict-only formula and score exactly as they did before the refinement.

Grade bands

  • A

    90+

    Production-ready

  • B

    75+

    Production-ready

  • C

    60+

    Emerging

  • D

    40+

    Draft

  • F

    0+

    Draft

This is a sample. Real score cards are generated by architect.validate and visible to you in /app/readiness-review/history (Pro/Teams).

Live self-review on prod
The architect ran on its own code
We pointed architect.validate at one of its own load-bearing features on the live prod endpoint. The badge below is the live SVG served from /api/badge/run/{run_id}/card.svg — same surface any Pro/Teams caller gets after their own validate run. Click through to read every per-principle finding.
Live self-review badge: production_ready, B, 86 / 100
production_readyB / 865 of 6 principles alignedNo high_risk findings

Honest signal: the architect found one principle still worth tightening, and that finding IS what production_ready means under the doctrine. production_ready is not a 100/100 stamp — it means trust boundaries hold. Hardening recommendations are iteration material, not a deficit.

Read every per-principle finding