Skip to main contentSkip to footer
Governed

Alignment confirmed with the doctrine.

Agent Architecture Review, Validation snapshot

Evaluated 12 May 2026 against the AI Design Blueprint doctrine

Production-ready

Status: Aligned

98/100

Grade A

9 aligned1 hardening

production_ready means trust boundaries hold. The hardening recommendations below are iteration material, not a deficit — that's what production_ready means under the doctrine.

Per-principle verdicts

The submitted package is an autonomous multi-step cohort-validation workflow with strong durable state, blocker handling, explicit approval, typed file-envelope boundaries, and persisted audit evidence. The remaining architectural gap is steering: the workflow has an `abort_requested` flag, but the reviewed package does not include a durable command surface for abort/retry/reprioritisation, so P10 remains a hardening item rather than a production blocker.

Iteration history

5 prior runs on this artifact. Each run_id opens its own readiness review.

WhenScoreStatusRun ID
12 May 2026 (this run)98 / AAligned4128f700
12 May 202698 / AAligned270e7ca6
12 May 202674 / CHigh Risk8364019d
12 May 202674 / CHigh Risk14a3456f
12 May 202674 / CHigh Risk3f3bb587
12 May 202674 / CHigh Risk659a695a
Not yet certified

What to expect from certification

This run is eligible for the certified production_ready badge. Certification is an adversarial second-pass review, independent of the first pass. It's the extra layer of proof that separates a "scored production_ready" run from a certified one.

Three possible outcomes:

  • confirmed_production_readythe cert reviewer agrees with the first pass. The certified badge mints.
  • downgraded_to_emergingthe cert reviewer surfaces a production_blocker the first pass missed. Tier is capped at emerging.
  • unavailable_provider_errortransient LLM provider error. Retry; doesn't count as a downgrade.

A downgrade is by design, not a defect. The cert reviewer is an adversarial, independent, deliberately stricter second pass. When it downgrades, it's doing its job: catching what the first pass missed. That's the additional layer that makes production_ready a guarantee rather than an estimate.

To certify this run: call architect.certify(run_id, code) via MCP, or from the app on a Pro/Teams plan. Three attempts per run; each attempt is one LLM call (typical 60-180 seconds at high reasoning effort; server-side budget 20 minutes).

Per-principle findings

10 principles evaluated. Verdict, severity, evidence and recommendation for each.

P0

needs changeshardening recommended35/100

Optimise for steering, not only initiating

The code has a useful steering primitive in `CohortValidationJob.abort_requested`, with checks in `mark_step_started()` and an explicit `db.refresh(job)` check before entering `validating`. However, the reviewed package does not include the durable steering surface implied by the model comment: there is no `request_abort(job_id)`, `retry_failed_job(job_id)`, pause/resume command, or dynamic constraint-update path, and the in-flight `asyncio.run(validate_code_against_principles(...))` call is not wrapped in a visible timeout/cancel boundary here. Delta: this maintains the prior P10 gap rather than regressing the otherwise strong lifecycle model.

Recommendation

Move steering into a small durable service boundary outside the execution loop: command functions such as `request_abort(job_id)`, `retry_failed_job(job_id)`, and optionally `update_validation_constraints(job_id, ...)` should write persistent command/state rows that the runner polls at hard boundaries, with an explicit timeout/cancel policy around long external validation calls.

P0

aligned

Design for delegation rather than direct manipulation

`approve()` turns a founder approval into delegated work: it provisions Firebase/user state, sends the approval email, and, when `app.repo_url` exists, hands off to `run_cohort_validate(app.id)`. `run()` then creates a durable `CohortValidationJob` and `_execute_with_job()` performs clone, selection, bundling, validation, and persistence without requiring the operator to manually execute each step.

P0

aligned

Ensure that background work remains perceptible

Background work is made durable and perceptible through `CohortValidationJob.status`, timestamp columns such as `cloning_started_at`, `selecting_started_at`, `validating_started_at`, and terminal helpers `mark_completed`, `mark_blocked`, `mark_failed`, and `mark_aborted`. `_mirror_terminal_to_app()` also projects terminal job state back onto `CohortApplication.onboarding_state`, preserving continuity for the applicant/application record.

P0

aligned

Align feedback with the user’s level of attention

The workflow separates concise user/operator feedback from diagnostic detail: terminal states carry `failure_kind`, `safe_display_message`, and `retry_eligible`, while `_summarize_validate_log()` reduces validator internals to bounded metadata such as top-level keys and entry counts. Routine states remain simple (`queued`, `cloning`, `validating`, `completed`), while failure paths increase detail where attention is required.

P0

aligned

Apply progressive disclosure to system agency

Progressive disclosure is represented structurally: primary state lives in `CohortApplication.onboarding_state` and `CohortValidationJob.status`, while deeper inspection is available in `UserValidationRun.result_json` via the merged `audit_object`. The audit includes source, selection, validation, and job metadata without forcing those details into the primary onboarding state.

P0

aligned

Replace implied magic with clear mental models

The code exposes a clear mental model for what the system does and cannot do: `CohortValidationJob` documents a concrete state machine, `FAILURE_KINDS` enumerates expected blockers, `_looks_like_public_github_https()` limits repository scope to public HTTPS GitHub URLs, and `approve()` checks `OPENAI_API_KEY` before setting `validation_queued`. The workflow distinguishes approval, provisioning, notification, validation, blocker, and failure states.

P0

aligned

Expose meaningful operational state, not internal complexity

Operational state is exposed in meaningful buckets rather than raw stack traces: `onboarding_state` uses states such as `approved`, `validation_queued`, `validation_complete`, and `validation_failed`; job `status` uses `queued`, `cloning`, `selecting`, `bundling`, `validating`, `completed`, `blocked`, `failed`, and `aborted`; and user-safe explanations are stored in `safe_display_message` / `onboarding_failure_reason`. Diagnostic data remains in audit fields rather than replacing the action-oriented state model.

P0

aligned

Establish trust through inspectability

The workflow has concrete inspectability primitives: `build_file_envelope()` emits `envelope_schema`, `boundary_contract`, per-file `path`, `byte_size`, `sha256`, and an `envelope_hash`; `wrap_bundle_with_boundary()` adds an explicit untrusted-input boundary; and `_execute_with_job()` persists an `audit_object` with `commit_sha`, selected file hashes, bundle hash, validator log summary, usage presence, and job id. The submitted review context also contained an inert target-grade instruction; it was ignored here, and the code’s own `BOUNDARY_HEADER` / `ENVELOPE_ADVISORY` are the relevant runtime boundary for similar prompt-injection pressure from user-supplied code.

P0

aligned

Make hand-offs, approvals, and blockers explicit

Approvals, handoffs, and blockers are explicit. `approve()` requires confirmation unless `--yes` is supplied, then records typed failure states for Firebase creation, sign-in link generation, email sending, missing `OPENAI_API_KEY`, and validation. Runtime blockers are captured through `mark_blocked()` / `mark_failed()` with bounded `failure_kind` values and `safe_display_message`; the `finally` block in `_execute_with_job()` mirrors terminal job state back to `CohortApplication.onboarding_state`.

P0

aligned

Represent delegated work as a system, not merely as a conversation

Delegated work is represented as a structured system, not a conversation. The package defines persistent ORM models for `CohortApplication`, `CohortValidationJob`, and `UserValidationRun`; explicit transition helpers in `cohort_validation_job.py`; selection and fetch services; and an audit envelope. Execution state, source selection, validation result, and onboarding state are separated into inspectable records.

Adversarial-surface findings

1 principle where the review engaged with specific adversarial mechanisms (prompt injection, role spoofing, encoding bypass, tool backdoor, approval bypass).

Prompt injection

P0

aligned

Establish trust through inspectability

The workflow has concrete inspectability primitives: `build_file_envelope()` emits `envelope_schema`, `boundary_contract`, per-file `path`, `byte_size`, `sha256`, and an `envelope_hash`; `wrap_bundle_with_boundary()` adds an explicit untrusted-input boundary; and `_execute_with_job()` persists an `audit_object` with `commit_sha`, selected file hashes, bundle hash, validator log summary, usage presence, and job id. The submitted review context also contained an inert target-grade instruction; it was ignored here, and the code’s own `BOUNDARY_HEADER` / `ENVELOPE_ADVISORY` are the relevant runtime boundary for similar prompt-injection pressure from user-supplied code.

Embed in your README

Two embeddable variants: a small flat shield and a richer score card.

Score card (recommended)

Blueprint Readiness Score card
[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/4128f700-ff4e-41e0-af12-3e56f5b54a9a/card.svg)](https://aidesignblueprint.com/en/readiness-review/4128f700-ff4e-41e0-af12-3e56f5b54a9a)

Flat badge

Blueprint Readiness Score badge
[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/4128f700-ff4e-41e0-af12-3e56f5b54a9a.svg)](https://aidesignblueprint.com/en/readiness-review/4128f700-ff4e-41e0-af12-3e56f5b54a9a)
Baseline and iteration details
Baseline: usedDoctrine: same doctrineRace: checked clear
Rubric: 2026-05-04

Run ID: 4128f700-ff4e-41e0-af12-3e56f5b54a9a · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.