The Architect Agent
Call architect.validate to get a Blueprint Readiness Score (0–100, grade A–F) on real code. The Architect Agent reviews your implementation against the 10 Blueprint principles, returns per-principle verdicts WITH numeric severity and confidence, and persists the run with a full reproducibility envelope so two callers with the same input can verify they got the same answer. Pro and Teams plans only.
Pro and Teams members. The Architect Agent is the Blueprint's authenticated review surface, it processes your code transiently with a strict zero-training policy, hardens the prompt boundary against injection in submitted code, and supports private_session=true to skip all server-side logging.
What you get back
Every run returns a structured response with: (1) `assessment` — overall status, summary, confidence, AND `code_classification` (autonomous_agentic_workflow vs non_agentic_component, with rationale) so you can see WHY some principles are marked not_applicable; (2) per-principle findings — verdict (aligned, mixed, needs_changes, high_risk, not_applicable), `severity_score` 0–100, `confidence` (low/medium/high), `evidence_quality` (sparse/moderate/strong), code-cited evidence, and a recommendation; (3) `readiness` — the Blueprint Readiness Score (0–100), grade (A–F), tier (production_ready / emerging / draft), counts per verdict bucket, whether grade was capped by a high_risk finding, and the rubric_version; (4) `recommended_examples` with `example_recommendation_status` so the run still completes if no curated examples match; (5) `processing` — `llm_latency_ms`, `total_latency_ms`, `timeout_budget_seconds`, `dependency_status`; (6) `reproducibility` — model, seed, system_fingerprint, doctrine_fingerprint, prompt_template_fingerprint, reasoning_effort, and `reproducibility_mode='best_effort'`; (7) `persistence_status` — `saved` or `failed`, with run_id / badge_url / review_url only when the durable write succeeded.
Severity_class scoring (production_blocker vs hardening_recommended)
Two `needs_changes` findings can have very different impact. A token-budget cap as defence-in-depth is not the same as an untyped error path that strands a real user. Each finding now carries a `severity_class` orthogonal to the verdict label: `production_blocker` (trust boundary fails, must fix before prod, contributes 0 credit), `hardening_recommended` (trust boundary holds, defence-in-depth note for the next iteration, full credit), `polish` (stylistic, full credit). The headline grade penalises only `production_blocker` and the legacy `high_risk` verdict; `hardening_recommended` and `polish` surface in a separate next-iteration list without dragging the score down. This lets `production_ready` mean trust boundaries hold rather than 100/100 — chasing the perfect score is the perfection-loop trap (Ford & Parsons fitness-function framing). Older runs without severity_class fall back to the legacy verdict + severity_score interpolation and grade exactly as they did before.
Reproducibility envelope (best-effort, but auditable)
Two callers submitting identical input get the same `seed`, derived from a collision-free JSON canonicalisation that covers every prompt-affecting field. The response carries four fingerprints so any divergence is diagnosable: `system_fingerprint` (OpenAI provider backend), `doctrine_fingerprint` (the principle definitions used), `prompt_template_fingerprint` (system prompt + scaffolding + JSON schema + model + reasoning_effort), and the seed itself. If a future deploy changes the system prompt or the doctrine, the corresponding fingerprint changes — silently breaking determinism is impossible by construction. The mode is explicitly `best_effort`: OpenAI seed gives stable sampling, not byte-identical replay. Per-finding `confidence` lets you tell a real disagreement from intrinsic LLM variance.
Trend memory across runs
Pass repository="<your-repo-name>" to architect.validate and the Architect Agent groups runs by project. Then call me.validation_history to see your latest score per repository, the delta versus the previous run, and the exact principles that regressed since last time. Agents can read this context before re-validating to focus the next review on what is actually changing, instead of re-flagging issues you have already fixed.
Sample score card
What you see after every architect.validate run on a repository.
This is a sample. Real score cards are generated by architect.validate and visible to you in /app/readiness-review/history (Pro/Teams).
Also in this section