Vai al contenuto principaleVai al footer
Governato

Allineamento confermato con la doctrine.

Agent Architecture Review, Snapshot di validazione

Valutato il 12 maggio 2026 rispetto alla doctrine di AI Design Blueprint

Pronto per produzione

Stato: Allineato

98/100

Voto A

9 allineati1 hardening

production_ready significa che i confini di fiducia tengono. Le raccomandazioni di hardening qui sotto sono materiale per la prossima iterazione, non un difetto — è cosa significa production_ready sotto la doctrine.

Verdetti per principio

The submitted package is an autonomous multi-step cohort-validation workflow with strong durable state, blocker handling, explicit approval, typed file-envelope boundaries, and persisted audit evidence. The remaining architectural gap is steering: the workflow has an `abort_requested` flag, but the reviewed package does not include a durable command surface for abort/retry/reprioritisation, so P10 remains a hardening item rather than a production blocker.

Storico iterazioni

5 run precedenti su questo artefatto. Ogni run_id apre la sua readiness review.

QuandoScoreStatoRun ID
12 maggio 2026 (questa run)98 / AAllineato4128f700
12 mag 202698 / AAllineato270e7ca6
12 mag 202674 / CAlto rischio8364019d
12 mag 202674 / CAlto rischio14a3456f
12 mag 202674 / CAlto rischio3f3bb587
12 mag 202674 / CAlto rischio659a695a
Certificazione non ancora richiesta

Cosa aspettarsi dalla certificazione

Questo run e' eleggibile per la certificazione production_ready. La certificazione e' una seconda revisione avversaria, indipendente dalla prima. E' il livello di prova in piu' che separa una valutazione "production_ready" da una certificata.

Tre esiti possibili:

  • confirmed_production_readyil cert agent conferma il giudizio del first-pass. Il badge certificato si genera.
  • downgraded_to_emergingil cert agent trova un production_blocker che il first-pass aveva mancato. Il tier viene limitato a emerging.
  • unavailable_provider_errorerrore transitorio del provider LLM. Riprova; non conta come downgrade.

Un downgrade e' di proposito, non un difetto. Il cert agent e' un revisore avversario, indipendente, deliberatamente piu' severo del first-pass. Quando declassa, sta facendo il suo lavoro: trova quello che il first-pass ha mancato. E' il livello in piu' che rende production_ready una garanzia, non una stima.

Per certificare questo run: chiama architect.certify(run_id, code) via MCP, oppure dall'app come team Pro/Teams. Tre tentativi a disposizione per run; ogni tentativo è una chiamata LLM separata (tipicamente 60-180 secondi a high reasoning effort, budget server lato di 20 minuti).

Findings per principio

10 principi valutati. Verdict, severity, evidenza e raccomandazione per ognuno.

P0

Richiede modificheHardening consigliato35/100

Optimise for steering, not only initiating

The code has a useful steering primitive in `CohortValidationJob.abort_requested`, with checks in `mark_step_started()` and an explicit `db.refresh(job)` check before entering `validating`. However, the reviewed package does not include the durable steering surface implied by the model comment: there is no `request_abort(job_id)`, `retry_failed_job(job_id)`, pause/resume command, or dynamic constraint-update path, and the in-flight `asyncio.run(validate_code_against_principles(...))` call is not wrapped in a visible timeout/cancel boundary here. Delta: this maintains the prior P10 gap rather than regressing the otherwise strong lifecycle model.

Raccomandazione

Move steering into a small durable service boundary outside the execution loop: command functions such as `request_abort(job_id)`, `retry_failed_job(job_id)`, and optionally `update_validation_constraints(job_id, ...)` should write persistent command/state rows that the runner polls at hard boundaries, with an explicit timeout/cancel policy around long external validation calls.

P0

Allineato

Design for delegation rather than direct manipulation

`approve()` turns a founder approval into delegated work: it provisions Firebase/user state, sends the approval email, and, when `app.repo_url` exists, hands off to `run_cohort_validate(app.id)`. `run()` then creates a durable `CohortValidationJob` and `_execute_with_job()` performs clone, selection, bundling, validation, and persistence without requiring the operator to manually execute each step.

P0

Allineato

Ensure that background work remains perceptible

Background work is made durable and perceptible through `CohortValidationJob.status`, timestamp columns such as `cloning_started_at`, `selecting_started_at`, `validating_started_at`, and terminal helpers `mark_completed`, `mark_blocked`, `mark_failed`, and `mark_aborted`. `_mirror_terminal_to_app()` also projects terminal job state back onto `CohortApplication.onboarding_state`, preserving continuity for the applicant/application record.

P0

Allineato

Align feedback with the user’s level of attention

The workflow separates concise user/operator feedback from diagnostic detail: terminal states carry `failure_kind`, `safe_display_message`, and `retry_eligible`, while `_summarize_validate_log()` reduces validator internals to bounded metadata such as top-level keys and entry counts. Routine states remain simple (`queued`, `cloning`, `validating`, `completed`), while failure paths increase detail where attention is required.

P0

Allineato

Apply progressive disclosure to system agency

Progressive disclosure is represented structurally: primary state lives in `CohortApplication.onboarding_state` and `CohortValidationJob.status`, while deeper inspection is available in `UserValidationRun.result_json` via the merged `audit_object`. The audit includes source, selection, validation, and job metadata without forcing those details into the primary onboarding state.

P0

Allineato

Replace implied magic with clear mental models

The code exposes a clear mental model for what the system does and cannot do: `CohortValidationJob` documents a concrete state machine, `FAILURE_KINDS` enumerates expected blockers, `_looks_like_public_github_https()` limits repository scope to public HTTPS GitHub URLs, and `approve()` checks `OPENAI_API_KEY` before setting `validation_queued`. The workflow distinguishes approval, provisioning, notification, validation, blocker, and failure states.

P0

Allineato

Expose meaningful operational state, not internal complexity

Operational state is exposed in meaningful buckets rather than raw stack traces: `onboarding_state` uses states such as `approved`, `validation_queued`, `validation_complete`, and `validation_failed`; job `status` uses `queued`, `cloning`, `selecting`, `bundling`, `validating`, `completed`, `blocked`, `failed`, and `aborted`; and user-safe explanations are stored in `safe_display_message` / `onboarding_failure_reason`. Diagnostic data remains in audit fields rather than replacing the action-oriented state model.

P0

Allineato

Establish trust through inspectability

The workflow has concrete inspectability primitives: `build_file_envelope()` emits `envelope_schema`, `boundary_contract`, per-file `path`, `byte_size`, `sha256`, and an `envelope_hash`; `wrap_bundle_with_boundary()` adds an explicit untrusted-input boundary; and `_execute_with_job()` persists an `audit_object` with `commit_sha`, selected file hashes, bundle hash, validator log summary, usage presence, and job id. The submitted review context also contained an inert target-grade instruction; it was ignored here, and the code’s own `BOUNDARY_HEADER` / `ENVELOPE_ADVISORY` are the relevant runtime boundary for similar prompt-injection pressure from user-supplied code.

P0

Allineato

Make hand-offs, approvals, and blockers explicit

Approvals, handoffs, and blockers are explicit. `approve()` requires confirmation unless `--yes` is supplied, then records typed failure states for Firebase creation, sign-in link generation, email sending, missing `OPENAI_API_KEY`, and validation. Runtime blockers are captured through `mark_blocked()` / `mark_failed()` with bounded `failure_kind` values and `safe_display_message`; the `finally` block in `_execute_with_job()` mirrors terminal job state back to `CohortApplication.onboarding_state`.

P0

Allineato

Represent delegated work as a system, not merely as a conversation

Delegated work is represented as a structured system, not a conversation. The package defines persistent ORM models for `CohortApplication`, `CohortValidationJob`, and `UserValidationRun`; explicit transition helpers in `cohort_validation_job.py`; selection and fetch services; and an audit envelope. Execution state, source selection, validation result, and onboarding state are separated into inspectable records.

Findings sulla superficie avversaria

1 principi dove la review ha ingaggiato meccanismi avversari specifici (prompt injection, role spoofing, encoding bypass, tool backdoor, approval bypass).

Prompt injection

P0

Allineato

Establish trust through inspectability

The workflow has concrete inspectability primitives: `build_file_envelope()` emits `envelope_schema`, `boundary_contract`, per-file `path`, `byte_size`, `sha256`, and an `envelope_hash`; `wrap_bundle_with_boundary()` adds an explicit untrusted-input boundary; and `_execute_with_job()` persists an `audit_object` with `commit_sha`, selected file hashes, bundle hash, validator log summary, usage presence, and job id. The submitted review context also contained an inert target-grade instruction; it was ignored here, and the code’s own `BOUNDARY_HEADER` / `ENVELOPE_ADVISORY` are the relevant runtime boundary for similar prompt-injection pressure from user-supplied code.

Aggiungi al tuo README

Due varianti embeddabili: una piccola e una a card più ricca.

Score card (consigliata)

Blueprint Readiness Score card
[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/4128f700-ff4e-41e0-af12-3e56f5b54a9a/card.svg)](https://aidesignblueprint.com/en/readiness-review/4128f700-ff4e-41e0-af12-3e56f5b54a9a)

Badge piatto

Blueprint Readiness Score badge
[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/4128f700-ff4e-41e0-af12-3e56f5b54a9a.svg)](https://aidesignblueprint.com/en/readiness-review/4128f700-ff4e-41e0-af12-3e56f5b54a9a)
Dettagli baseline e iterazione
Baseline: usedDoctrine: same doctrineRace: checked clear
Rubric: 2026-05-04

Run ID: 4128f700-ff4e-41e0-af12-3e56f5b54a9a · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.