Diagnostic

ResearchCitationsInspectable

Run by agents. Governed by humans.

The demo-to-production gap in agentic AI.

Demos are optimised for single-path capability. Production needs state, resumability, approvals, monitoring, and recovery. The strongest public practitioner record names the same gap.

Inspect a real validator run →Read the principles →

Live readiness review

Real score, live and inspectable

100/ 100

Production-ready

Per-principle verdicts

10 / 10 principles aligned0 production blockers4-iteration trajectory · cert confirmed

Live validator run · public run-id ffcc637e…Inspect this review

Key Facts

AI pilots without P&L return: 95%
Agentic systems scaled to production: 23%
Projects forecast cancelled by 2027: >40%
Failure modes mapped to principles: 10

The diagnosis

The pattern is consistent across the public record. Demos succeed by optimising one path: a polished prompt, a single tool call, a curated input. Production fails because the path is no longer single. Real users send malformed inputs, the network drops, the model retries, the queue backs up, the approval gate fires, the operator pauses the run, the session times out, the database write fails. Each of those is its own runtime concern, and each one is what the demo was not designed to handle. The work that closes the gap is not a better prompt. It is state, resumability, approvals, monitoring, and recovery: the architectural primitives operators have been building into systems for decades, applied to a new substrate. The principles below name what those primitives look like for agentic AI.

If your agent ships any of these, the doctrine has a name for it

Four recurring failure patterns on agentic systems in production. Each one links to the principle that addresses it.

Failure mode

Conversation as the only coordination model

Your agent talks back, but there is no run-state surface, no approval queue, no pause / resume. The chat IS the coordination. That breaks the moment work has to outlive a single message.

See the principle

Failure mode

Silent background failure

Work runs in the background, the user never sees progress, and a crash leaves them with a stale UI and no way to recover. Disabled buttons are not state.

See the principle

Failure mode

Missing approval boundary

The agent fires irreversible actions (payments, posts, sends, deploys) under model output alone. The approval gate is decoration, not load-bearing.

See the principle

Failure mode

State loss on refresh

The user reloads, the agent forgets. There is no durable run record, no resumable session, no audit trail. Every operator interaction is a fresh start with no continuity. There is no run, only a chat history.

See the principle

The numbers that name the gap

Every stat below links to its publisher. None are AI-summarised. Items the source research couldn't fully verify are flagged as such, not filled with vendor folklore.

Metric	Number	What it means	Source
Organizations experimenting with AI agents	62%	High curiosity, low maturity	McKinsey Global Survey · 2025
Organizations scaling an agentic AI system	23%	The gap quantified, the prototype-to-deployment drop	McKinsey Global Survey · 2025
Organizations abandoning most AI initiatives before production	42%	Pilot-to-production drop-off worsening, rose from 17%	S&P Global Market Intelligence · 2025-10-15
Average POC scrap rate before production	46%	Mean attrition across the S&P sample	S&P Global Market Intelligence · 2025-10-15
Gartner forecast for agentic AI cancellation by end 2027	>40%	Tied to escalating costs, unclear value, inadequate risk controls	Gartner press release · 2025-06-25
LangChain 2026 survey top production blocker	32%	Quality outranks cost as the load-bearing blocker	State of Agent Engineering 2026 · 2026
MIT NANDA 'zero measurable return' figure	95%	Sharpest public measure of value shortfall on enterprise GenAI	MIT NANDA · State of AI in Business · 2025

Three commonly-asked numbers do not have a high-confidence public figure: average time-to-abandon for stalled agent projects, cost overrun rate versus initial estimate, and a universal eval-to-production regression rate. The validator beta is open to teams whose case studies would help close these gaps.

What practitioners say on the record

Every quote below is verbatim from the practitioner's own writing or talk. Each links to its primary source. None are paraphrased; none come from vendor marketing.

We focused on control and durability.

Nuno Campos · Author / Engineer, LangGraph
Building LangGraph: Designing an Agent Runtime from first principles · 4 Sept 2025

Most 'agents' today are essentially stateless workflows.

Charles Packer · Co-founder, Letta
Stateful Agents: The Missing Link in LLM Intelligence · 6 Feb 2025

They do the context engineering.

Andrej Karpathy · Researcher / Engineer
2025 LLM Year in Review · 19 Dec 2025

What to build and how to evaluate it is still hard.

Andrew Ng · Founder, DeepLearning.AI
Andrew Ng on Building with AI: Speed, Smarts, and Scale · 28 Jan 2026

Agents are inherently dangerous.

Simon Willison · Independent researcher / developer
Designing agentic loops · 30 Sept 2025

We've spent 60-80% of development time on error analysis and evaluation.

Hamel Husain & Shreya Shankar · AI evaluation practitioners
LLM Evals: Everything You Need to Know · 15 Jan 2026

What the runtime ecosystem says it solves

This landscape reports each project's public claim alongside what the public record verifies that claim covers. Vendor inclusion implies neither endorsement nor competitive comparison. Items flagged in the source research as unverified are excluded; this list will expand as additional vendors are independently verified.

Vendor	Public claim	What it solves	Gap that remains
LangChain / LangGraph / LangSmith Building LangGraph (2025-09) + State of Agent Engineering 2026	Production deployment for long-running, stateful agents; observability; evaluation	Strongest verified runtime story: task queues, checkpointing, tracing, HITL interrupts, deployment surfaces	Governance is largely app-owned. Business policy and irreversible-action approvals are not framework-solved.
OpenAI Agents SDK OpenAI Cookbook (Building Reliable Agents, 2026-05)	Code-first agents with tools, handoffs, approvals, tracing, sandboxes	Building-block SDK when the application owns orchestration, tool execution, approvals, and state. Memory-compaction patterns documented.	Burden remains on the shipping team to define approvals, persistence, and state policy correctly.
Anthropic Measuring AI agent autonomy + Trustworthy agents in practice (2026)	Trustworthy agents, MCP ecosystem, post-deployment monitoring guidance	Real-world autonomy measurement, post-deployment monitoring, human-in-the-loop safety stance	Anthropic's own materials warn that prompt injection and unattended autonomy remain open. Approval and least-privilege design remain external engineering tasks.
Letta Stateful Agents: The Missing Link in LLM Intelligence (2025-02)	Stateful agents with persistent memory and learning over time	Strong on persistent memory, state architecture, and context management	A memory-first architecture helps with continuity, but is not by itself a full governed runtime with enterprise approval and policy control.

Five wrong moves the practitioner record names

These are the patterns named by the deep-research source as recurring misconceptions about closing the demo-to-production gap. Each card carries a counter-quote from the practitioner who named the limit publicly.

If the prompt gets better, the system is production-ready.

The operator literature consistently says the real work is orchestration, memory, tools, traces, approvals, and state, not prompt craft.

Counter-evidence

Instead, we focused on control and durability.

Nuno Campos

Once you add enough eval dashboards, the problem is done.

Public practitioners say evals remain immature, easy to game, and incomplete without production traces.

Counter-evidence

Be wary of optimizing for high eval pass rates.

Hamel Husain & Shreya Shankar

A visible 'approve' button is enough governance.

Decorative human-in-the-loop without blocking semantics, tool visibility, and permission control does not meaningfully constrain the system.

Counter-evidence

I suggest treating those SHOULDs as if they were MUSTs.

Simon Willison

Frontier capability will erase the production gap.

Stronger models still need harnesses, data flow, runtime controls, and memory handling.

Counter-evidence

The next major advancement in AI won't come from larger models.

Charles Packer

The path from pilot to production is closed by tuning the weights.

Field evidence says most teams are not fine-tuning, and the harder problems are evaluation, policy, state, and integration.

Counter-evidence

Master prompting first, then fine-tuning, and only then look at RL.

Andrew Ng

The doctrine answers each failure mode

Ten failure modes the source research surfaced as recurring on production agentic systems. For each, the AI Design Blueprint principles that address it.

Conversation as the only coordination model
Multi-step work is driven through chat alone, with no underlying orchestration to recover, retry, or hand off cleanly.
PrinciplesP4
Silent background failure
Long-running work fails without surfacing progress or error to the operator who started the run.
PrinciplesP2 P3
Missing approval boundary
Agents take consequential actions (writes, sends, payments) without an explicit human gate at the boundary.
PrinciplesP7 P8
State loss on refresh
A page reload, tab close, or session timeout erases the agent's working memory mid-task.
PrinciplesP6
Eval-to-production drift
Pre-launch evaluations look clean; production traffic exposes a class of inputs the eval set never covered.
PrinciplesP9
Tool-use cascade hallucination
An agent invents tool arguments or response shapes; the next step trusts the fabrication and acts on it.
PrinciplesP4 P7
Context window collapse
Long conversations exceed the model's window; older context drops silently and behaviour degrades without warning.
PrinciplesP6
Approval gate decoration
An approval prompt exists, but the operator can only click 'approve': no inspect, no edit, no rejection path with a reason.
PrinciplesP7 P8
Demo-only happy path
The system works on the scripted demo flow; the second branch is never built or even named.
PrinciplesP9 P10
Loop detection failure
An agent retries the same failing action indefinitely, burning tokens and time without escalating to a human.
PrinciplesP3 P10

The validator runs the diagnosis above against your architecture. It produces a readiness review with a public run-id, a per-principle alignment row, and an iteration trajectory the next reviewer can audit.

If you are shipping into any of the failure modes named on this page, the closed beta is open. Your readiness review will be public; your case study would close one of the open research items above. The discipline the doctrine asks of others, the platform applies to itself first, every score on this site links to a real validator run, including the one for the platform's own architecture.

Apply to the closed beta →Read the principles →

Key Facts

If your agent ships any of these, the doctrine has a name for it

The numbers that name the gap

What practitioners say on the record

What the runtime ecosystem says it solves

Five wrong moves the practitioner record names

The doctrine answers each failure mode

Conversation as the only coordination model

Silent background failure

Missing approval boundary

State loss on refresh

Eval-to-production drift

Tool-use cascade hallucination

Context window collapse

Approval gate decoration

Demo-only happy path

Loop detection failure