Vai al contenuto principaleVai al footer
Revisionato

Valutazione completata; in attesa di revisione delle prove.

Agent Architecture Review, Snapshot di validazione

Valutato il 23 maggio 2026 rispetto alla doctrine di AI Design Blueprint

Pronto per produzione

Stato: Allineato

98/100

Voto A

9 allineati1 hardening

production_ready significa che i confini di fiducia tengono. Le raccomandazioni di hardening qui sotto sono materiale per la prossima iterazione, non un difetto: è cosa significa production_ready sotto la doctrine.

Blueprint Readiness misura l'allineamento alla doctrine, non la runtime correctness. Un verdetto production-ready significa che l'architettura incarna i 10 principi; non esegue i tuoi test o i tuoi tipi. Affiancala alla tua test suite, non sostituirla.

Verdetti per principio

The submitted code is an autonomous background validation workflow and the SEP-1686 task adoption materially closes the durable lifecycle gap: `task_id`, `task_state`, history recovery, projected UI status, dedupe signalling, and cancellation are now represented as durable product primitives rather than chat-scrollback state. The main remaining gap is inspectability of how the validation result was produced: the snippets show lifecycle/state persistence, but not a compact provenance or event ledger for model/tool/policy decisions behind `_execute_validate_body()`.

Certificazione non ancora richiesta

Cosa aspettarsi dalla certificazione

Questo run è eleggibile per la certificazione production_ready. La certificazione è una seconda revisione avversaria, indipendente dalla prima. È il livello di prova in più che separa una valutazione "production_ready" da una certificata.

Tre esiti possibili:

  • confirmed_production_readyil cert agent conferma il giudizio del first-pass. Il badge certificato si genera.
  • downgraded_to_emergingil cert agent trova un production_blocker che il first-pass aveva mancato. Il tier viene limitato a emerging.
  • unavailable_provider_errorerrore transitorio del provider LLM. Riprova; non conta come downgrade.

Un downgrade è di proposito, non un difetto. Il cert agent è un revisore avversario, indipendente, deliberatamente più severo del first-pass. Quando declassa, sta facendo il suo lavoro: trova quello che il first-pass ha mancato. È il livello in più che rende production_ready una garanzia, non una stima.

Per certificare questo run: chiama architect.certify(run_id, code) via MCP, oppure dall'app come team Pro/Teams. Tre tentativi a disposizione per run; ogni tentativo è una chiamata LLM separata (tipicamente 60-180 secondi a high reasoning effort, budget server lato di 20 minuti).

Findings per principio

10 principi valutati. Verdict, severity, evidenza e raccomandazione per ognuno.

P7

Richiede modificheHardening consigliato35/100

Establish trust through inspectability

The submission improves lifecycle inspectability with durable `UserValidationRun` rows, `task_id`, `task_state`, `structuredContent=task_result_dict`, and `me.validation_history(run_id)`. However, the snippets do not show a provenance/event ledger for how `_execute_validate_body()` produced the assessment: there is no visible persisted trace of model/policy version, source bundle identity beyond `code_fingerprint`, tool calls, decision checkpoints, or result-generation events. The current primitive makes the task lifecycle inspectable, but not yet the production path of the validation judgement itself.

Raccomandazione

Persist a compact run provenance ledger keyed by `run_id` outside the execution loop, recording the input/code fingerprint, policy/model version, relevant tool calls, decision checkpoints, and final result hash; link that ledger from `validation_history` while keeping the dashboard projection separate from raw internals.

P1

Allineato

Design for delegation rather than direct manipulation

Delegation is represented explicitly: the handler accepts `params.task: {ttl}`, mints a `run_id`/`task_id` at start, returns `CreateTaskResult` immediately, and performs validation in background through `experimental.run_task`. The REST `cancel_task_on_validation_run` endpoint and `can_cancel` projection give the user a termination control instead of forcing them to wait on a blocking call.

P2

Allineato

Ensure that background work remains perceptible

The previous ephemeral lifecycle is replaced with durable, perceptible primitives: `UserValidationRun.task_id` is a top-level recovery handle, `task_state` stores lifecycle JSON, `project_task_state()` projects it into dashboard-facing status, and `me.validation_history(run_id)` is described as the recovery path. The UI also renders `task_status` across history, projects, dashboard recent activity, and detail surfaces.

P?

Allineato

Align feedback with the user’s level of attention

Feedback is proportionate: the primary UI consumes the compact `TaskStateProjection` with `task_status`, `status_message`, `terminal_reason`, and `can_cancel`, while routine background work is shown through badges/pills rather than full protocol internals. Terminal and exceptional states such as `failed`, `cancelled`, `task_state_parse_error`, and no-result cards escalate only when user attention is needed.

P4

Allineato

Apply progressive disclosure to system agency

The code applies progressive disclosure by separating the raw task protocol state from the dashboard surface. `project_task_state()` strips `schema_version`, `idempotency_key`, `progress_token`, and `ttl_ms`; normal views show user-level lifecycle state, while deeper recovery/inspection happens through `me.validation_history(run_id)` and task result retrieval.

P5

Allineato

Replace implied magic with clear mental models

The workflow replaces implied magic with named states and documented limits: `_PROJECTION_STATUS_MAP` maps protocol states like `working` and `input_required` into `active` and `awaiting_action`; cancelled/failed-without-assessment runs get explicit no-result handling; and the asymmetry that only `architect.validate` is task-augmented while `validate_consensus` and `certify` remain sync-only is disclosed on rendered surfaces.

P6

Allineato

Expose meaningful operational state, not internal complexity

The user-facing state is meaningful rather than internal. `TaskStateProjection` exposes `task_status`, `status_message`, timestamps, `terminal_reason`, and `can_cancel`, while intentionally hiding protocol fields such as `idempotency_key`, `progress_token`, and `ttl_ms`. Frontend reads consume the projection only, avoiding client-side parsing of raw `task_state`.

P8

Allineato

Make hand-offs, approvals, and blockers explicit

Hand-offs and blockers are explicit. Idempotency dedupe returns an existing `CreateTaskResult` and annotates `_meta.dedupe_hit` plus `_meta.dedupe_link` rather than silently swallowing duplicate work. The task worker raises on structured error envelopes so the SDK can mark the task failed instead of completed, and cancellation returns clear 400/404/409 cases in `cancel_task_on_validation_run`.

P9

Allineato

Represent delegated work as a system, not merely as a conversation

Delegated validation work is represented as a system: `PgValidationTaskStore` backs tasks with database rows, `task_id` is unique-indexed, `task_state` persists lifecycle, tasks are queryable/listable/cancellable through task methods, and dashboard surfaces render lifecycle status independently of conversation text. This is a durable task model rather than an unstructured transcript.

P10

Allineato

Optimise for steering, not only initiating

The code supports mid-run steering through cancellation. `cancel_task_on_validation_run()` exposes a user-facing cancel endpoint; `PgValidationTaskStore.update_task(..., status='cancelled')` writes terminal state and sets `row.cancel_requested = True`; terminal-to-nonterminal transitions are rejected with `is_terminal()`, and the UI uses `can_cancel` to show the control only while actionable. The documentation also honestly frames that in-flight provider calls may not be abortable, avoiding a false mental model of magical cancellation.

Aggiungi al tuo README

Due varianti embeddabili: una piccola e una a card più ricca.

Score card (consigliata)

Blueprint Readiness Score card
[![Blueprint Readiness Score card](https://aidesignblueprint.com/api/badge/run/2876f354-e141-41f3-9582-b413567b0f77/card.svg)](https://aidesignblueprint.com/en/readiness-review/2876f354-e141-41f3-9582-b413567b0f77)

Badge piatto

Blueprint Readiness Score badge
[![Blueprint Readiness Score](https://aidesignblueprint.com/api/badge/run/2876f354-e141-41f3-9582-b413567b0f77.svg)](https://aidesignblueprint.com/en/readiness-review/2876f354-e141-41f3-9582-b413567b0f77)
Dettagli baseline e iterazione
Rubric: 2026-05-04

Run ID: 2876f354-e141-41f3-9582-b413567b0f77 · Results expire after 90 days

Run by agents. Governed by humans. Validated by the AI Design Blueprint.