Skip to main contentSkip to footer
Diagnostic
ResearchCitationsInspectable

Run by agents. Governed by humans.

The demo-to-production gap in agentic AI.

Demos are optimised for single-path capability. Production needs state, resumability, approvals, monitoring, and recovery. The strongest public practitioner record names the same gap.

Key Facts

AI pilots without P&L return
95%
Agentic systems scaled to production
23%
Projects forecast cancelled by 2027
>40%
Failure modes mapped to principles
10

The diagnosis

The pattern is consistent across the public record. Demos succeed by optimising one path: a polished prompt, a single tool call, a curated input. Production fails because the path is no longer single. Real users send malformed inputs, the network drops, the model retries, the queue backs up, the approval gate fires, the operator pauses the run, the session times out, the database write fails. Each of those is its own runtime concern, and each one is what the demo was not designed to handle. The work that closes the gap is not a better prompt. It is state, resumability, approvals, monitoring, and recovery: the architectural primitives operators have been building into systems for decades, applied to a new substrate. The principles below name what those primitives look like for agentic AI.

If your agent ships any of these, the doctrine has a name for it

Four recurring failure patterns on agentic systems in production. Each one links to the principle that addresses it.

The numbers that name the gap

Every stat below links to its publisher. None are AI-summarised. Items the source research couldn't fully verify are flagged as such, not filled with vendor folklore.

MetricNumberWhat it meansSource
Organizations experimenting with AI agents62%High curiosity, low maturityMcKinsey Global Survey · 2025
Organizations scaling an agentic AI system23%The gap quantified, the prototype-to-deployment dropMcKinsey Global Survey · 2025
Organizations abandoning most AI initiatives before production42%Pilot-to-production drop-off worsening, rose from 17%S&P Global Market Intelligence · 2025-10-15
Average POC scrap rate before production46%Mean attrition across the S&P sampleS&P Global Market Intelligence · 2025-10-15
Gartner forecast for agentic AI cancellation by end 2027>40%Tied to escalating costs, unclear value, inadequate risk controlsGartner press release · 2025-06-25
LangChain 2026 survey top production blocker32%Quality outranks cost as the load-bearing blockerState of Agent Engineering 2026 · 2026
MIT NANDA 'zero measurable return' figure95%Sharpest public measure of value shortfall on enterprise GenAIMIT NANDA · State of AI in Business · 2025

Three commonly-asked numbers do not have a high-confidence public figure: average time-to-abandon for stalled agent projects, cost overrun rate versus initial estimate, and a universal eval-to-production regression rate. The validator beta is open to teams whose case studies would help close these gaps.

What practitioners say on the record

Every quote below is verbatim from the practitioner's own writing or talk. Each links to its primary source. None are paraphrased; none come from vendor marketing.

We focused on control and durability.
Nuno Campos · Author / Engineer, LangGraph
Building LangGraph: Designing an Agent Runtime from first principles · 4 Sept 2025
Most 'agents' today are essentially stateless workflows.
Charles Packer · Co-founder, Letta
Stateful Agents: The Missing Link in LLM Intelligence · 6 Feb 2025
They do the context engineering.
Andrej Karpathy · Researcher / Engineer
2025 LLM Year in Review · 19 Dec 2025
What to build and how to evaluate it is still hard.
Andrew Ng · Founder, DeepLearning.AI
Andrew Ng on Building with AI: Speed, Smarts, and Scale · 28 Jan 2026
Agents are inherently dangerous.
Simon Willison · Independent researcher / developer
Designing agentic loops · 30 Sept 2025
We've spent 60-80% of development time on error analysis and evaluation.
Hamel Husain & Shreya Shankar · AI evaluation practitioners
LLM Evals: Everything You Need to Know · 15 Jan 2026

What the runtime ecosystem says it solves

This landscape reports each project's public claim alongside what the public record verifies that claim covers. Vendor inclusion implies neither endorsement nor competitive comparison. Items flagged in the source research as unverified are excluded; this list will expand as additional vendors are independently verified.

VendorPublic claimWhat it solvesGap that remains
LangChain / LangGraph / LangSmith

Building LangGraph (2025-09) + State of Agent Engineering 2026

Production deployment for long-running, stateful agents; observability; evaluationStrongest verified runtime story: task queues, checkpointing, tracing, HITL interrupts, deployment surfacesGovernance is largely app-owned. Business policy and irreversible-action approvals are not framework-solved.
OpenAI Agents SDK

OpenAI Cookbook (Building Reliable Agents, 2026-05)

Code-first agents with tools, handoffs, approvals, tracing, sandboxesBuilding-block SDK when the application owns orchestration, tool execution, approvals, and state. Memory-compaction patterns documented.Burden remains on the shipping team to define approvals, persistence, and state policy correctly.
Anthropic

Measuring AI agent autonomy + Trustworthy agents in practice (2026)

Trustworthy agents, MCP ecosystem, post-deployment monitoring guidanceReal-world autonomy measurement, post-deployment monitoring, human-in-the-loop safety stanceAnthropic's own materials warn that prompt injection and unattended autonomy remain open. Approval and least-privilege design remain external engineering tasks.
Letta

Stateful Agents: The Missing Link in LLM Intelligence (2025-02)

Stateful agents with persistent memory and learning over timeStrong on persistent memory, state architecture, and context managementA memory-first architecture helps with continuity, but is not by itself a full governed runtime with enterprise approval and policy control.

Five wrong moves the practitioner record names

These are the patterns named by the deep-research source as recurring misconceptions about closing the demo-to-production gap. Each card carries a counter-quote from the practitioner who named the limit publicly.

If the prompt gets better, the system is production-ready.

The operator literature consistently says the real work is orchestration, memory, tools, traces, approvals, and state, not prompt craft.

Counter-evidence

Instead, we focused on control and durability.

Nuno Campos

Once you add enough eval dashboards, the problem is done.

Public practitioners say evals remain immature, easy to game, and incomplete without production traces.

Counter-evidence

Be wary of optimizing for high eval pass rates.

Hamel Husain & Shreya Shankar

A visible 'approve' button is enough governance.

Decorative human-in-the-loop without blocking semantics, tool visibility, and permission control does not meaningfully constrain the system.

Counter-evidence

I suggest treating those SHOULDs as if they were MUSTs.

Simon Willison

Frontier capability will erase the production gap.

Stronger models still need harnesses, data flow, runtime controls, and memory handling.

Counter-evidence

The next major advancement in AI won't come from larger models.

Charles Packer

The path from pilot to production is closed by tuning the weights.

Field evidence says most teams are not fine-tuning, and the harder problems are evaluation, policy, state, and integration.

Counter-evidence

Master prompting first, then fine-tuning, and only then look at RL.

Andrew Ng

The doctrine answers each failure mode

Ten failure modes the source research surfaced as recurring on production agentic systems. For each, the AI Design Blueprint principles that address it.

  1. Conversation as the only coordination model

    Multi-step work is driven through chat alone, with no underlying orchestration to recover, retry, or hand off cleanly.

    PrinciplesP4
  2. Silent background failure

    Long-running work fails without surfacing progress or error to the operator who started the run.

    PrinciplesP2P3
  3. Missing approval boundary

    Agents take consequential actions (writes, sends, payments) without an explicit human gate at the boundary.

    PrinciplesP7P8
  4. State loss on refresh

    A page reload, tab close, or session timeout erases the agent's working memory mid-task.

    PrinciplesP6
  5. Eval-to-production drift

    Pre-launch evaluations look clean; production traffic exposes a class of inputs the eval set never covered.

    PrinciplesP9
  6. Tool-use cascade hallucination

    An agent invents tool arguments or response shapes; the next step trusts the fabrication and acts on it.

    PrinciplesP4P7
  7. Context window collapse

    Long conversations exceed the model's window; older context drops silently and behaviour degrades without warning.

    PrinciplesP6
  8. Approval gate decoration

    An approval prompt exists, but the operator can only click 'approve': no inspect, no edit, no rejection path with a reason.

    PrinciplesP7P8
  9. Demo-only happy path

    The system works on the scripted demo flow; the second branch is never built or even named.

    PrinciplesP9P10
  10. Loop detection failure

    An agent retries the same failing action indefinitely, burning tokens and time without escalating to a human.

    PrinciplesP3P10

The validator runs the diagnosis above against your architecture. It produces a readiness review with a public run-id, a per-principle alignment row, and an iteration trajectory the next reviewer can audit.

If you are shipping into any of the failure modes named on this page, the closed beta is open. Your readiness review will be public; your case study would close one of the open research items above. The discipline the doctrine asks of others, the platform applies to itself first, every score on this site links to a real validator run, including the one for the platform's own architecture.