The Architect Agent
Call architect.validate to get a Blueprint Readiness Score (0–100, grade A–F) on real code. The Architect Agent reviews your implementation against the 10 Blueprint principles, returns per-principle verdicts WITH numeric severity and confidence, and persists the run with a full reproducibility envelope so two callers with the same input can verify they got the same answer. Pro and Teams plans only.
Pro and Teams members. The Architect Agent is the Blueprint's authenticated review surface, it reviews your code under a strict zero-training policy, hardens the prompt boundary against injection in submitted code, and supports private_session=true to skip all server-side logging.
What you get back
Every run returns a structured response with seven blocks:
assessmentOverall status, summary, confidence, and code_classification (autonomous_agentic_workflow vs non_agentic_component, with rationale) so you can see why some principles are marked not_applicable.
findings[]Per-principle verdict (aligned, mixed, needs_changes, high_risk, not_applicable), severity_score 0–100, confidence (low/medium/high), evidence_quality (sparse/moderate/strong), code-cited evidence, and a recommendation.
readinessThe Blueprint Readiness Score (0–100), grade (A–F), tier (production_ready / emerging / draft), counts per verdict bucket, whether the grade was capped by a high_risk finding, and the rubric_version.
recommended_examplesCarries example_recommendation_status so the run still completes if no curated examples match.
processingllm_latency_ms, total_latency_ms, timeout_budget_seconds, dependency_status.
reproducibilitymodel, seed, system_fingerprint, doctrine_fingerprint, prompt_template_fingerprint, reasoning_effort, and reproducibility_mode='best_effort'.
persistence_statussaved or failed, with run_id / badge_url / review_url surfaced only when the durable write succeeded.
Severity_class scoring (production_blocker vs hardening_recommended)
Two needs_changes findings can have very different impact. A token-budget cap as defence-in-depth is not the same as an untyped error path that strands a real user. Each finding now carries a severity_class orthogonal to the verdict label:
production_blockerTrust boundary fails. Must fix before prod. Contributes 0 credit to the score.
hardening_recommendedTrust boundary holds. Defence-in-depth note for the next iteration. Full credit.
polishStylistic, non-load-bearing. Full credit.
The headline grade penalises only production_blocker and the legacy high_risk verdict. hardening_recommended and polish surface in a separate next-iteration list without dragging the score down. This lets production_ready mean trust boundaries hold rather than 100/100. Older runs without severity_class fall back to the legacy verdict + severity_score interpolation and grade exactly as they did before.
Honest score, honest uncertainty
The Blueprint Readiness Score reflects what the Architect Agent is confident about, and what it isn't. When the architect is genuinely uncertain on a principle (verdicts that could flip on a re-run), you see that uncertainty surface next to the score as a stability signal, not buried inside one number. The certified production_ready badge is reserved for runs where the architect's read is confident across every principle, not just lucky on a single shot. So a single high-scoring run is not enough to mint the badge. The architect must agree with itself on independent re-evaluation. The variance you would otherwise have to discover by re-running yourself surfaces up front, in the same response.
Reproducibility envelope (best-effort, but auditable)
Two callers submitting identical input get the same seed, derived from a collision-free JSON canonicalisation that covers every prompt-affecting field. The response carries four fingerprints so any divergence is diagnosable:
system_fingerprintOpenAI provider backend identifier.
doctrine_fingerprintThe principle definitions used for this run.
prompt_template_fingerprintsystem prompt + scaffolding + JSON schema + model + reasoning_effort, hashed together.
seedThe deterministic sampling seed itself.
If a future deploy changes the system prompt or the doctrine, the corresponding fingerprint changes. Silently breaking determinism is impossible by construction. The mode is explicitly best_effort: OpenAI seed gives stable sampling, not byte-identical replay. Per-finding confidence lets you tell a real disagreement from intrinsic LLM variance.
This is a sample. Real score cards are generated by architect.validate and visible to you in /app/readiness-review/history (Pro/Teams).
The architect grades its own code, and we publish every run
A deliberate, full architect.validate self-review of the validate-handler scope, then architect.certify. The certified production_ready badge lands here when it earns it. The same loop the skill above teaches, applied to ourselves.
Also in this section