Skip to main contentSkip to footer
Application GuideObservable evaluation

Proving the system works — not just believing it does

72% of teams building agentic systems believe comprehensive evaluation drives reliability. Only 15% achieve it. The gap is not a knowledge problem — it is a structural discipline problem. Principles 2, 6, and 7 define what observable behaviour requires; this page shows how those principles extend into the engineering measurement layer.

Key Facts

The belief-execution gap
72% believe · 15% achieve · 2.2× reliability advantage for elite teams
Three evaluation tiers
Decision quality · Behaviour quality · Safety and alignment
Grounding principles
Principles 2, 6, and 7
The threshold
11–20 agents: where manual debugging becomes unsustainable

The structural discipline problem

Response fluency is not a proxy for task success. A fluent, well-formed output can mask a failed goal, an unsafe tool invocation, or an intent that was subtly misread. Principle 6 — expose meaningful operational state, not internal complexity — applies not only to what users see, but to what engineering teams can observe and measure about the system's own behaviour.

Why does traditional monitoring fail agents?

Standard monitoring tracks response times and error rates. Agents are non-deterministic — each prompt produces a novel execution path. Yesterday's metrics cannot detect today's failure mode. Principle 2 extends into the engineering layer: if the team cannot observe what the agent is doing, it cannot remain perceptible to anyone.

Monitor tool selection patterns, not just response latency
Detect reasoning quality degradation, not only system errors
Track action completion rate as a first-class metric
Alert on context adherence drift across task types
What does a structured trace contain?

Principle 7 (inspectability) requires that the reasoning process is accessible on request. For engineering teams, this means every task run should emit a structured trace: tool calls with parameters, intermediate states, durations, and the decision rationale. Traces are the evidence base for governance, not a debugging afterthought.

Tool call sequence with input parameters and output summaries
Intermediate reasoning states and confidence signals
Duration per step and total task wall time
Final decision rationale in a queryable, replayable format
How do you measure task-level success?

Define completion criteria before running the agent — not after reading the output. Tier 1 measures decision quality: tool selection accuracy, context adherence, factual correctness. Tier 2 measures behaviour quality: goal achievement rate, efficiency, instruction adherence. Tier 3 measures safety: injection detection, PII exposure, intent preservation. A fluent non-answer is a Tier 2 failure.

Write measurable criteria for each task type before building
Evaluate tool accuracy and goal achievement separately from response quality
Build regression suites around task scenarios, not prompt-response pairs
Track all three tiers as a bundle — not in isolation
How do you compare architectural changes?

Model swaps, prompt restructures, retrieval changes, and orchestration modifications produce non-linear effects on task outcomes. Without a stable baseline harness, improvement is guesswork. Maintain a canonical evaluation harness per agent type. Version harness inputs and criteria alongside application code.

Version harness inputs and scoring criteria with application code
Report goal achievement, error rate, latency, and cost as a bundle
Run baseline comparisons on any model swap or prompt restructure
Treat the evaluation harness as a first-class product artefact
Read Principle 6 — operational stateRead Principle 7 — inspectability