Skip to main contentSkip to footer
ExamplescriptadvancedRunnableagent-harness

Level 4: Agent Harness — Full Runtime Access

Give the agent a full runtime via the agent runtime SDK. It can search files, read docs, and reason through problems autonomously.

Key Facts

Level
advanced
Runtime
Python • Pydantic + Python Dotenv
Pattern
Runtime-backed investigation with explicit review artifacts
Interaction
Live sandbox • Script
Updated
14 March 2026

Navigate this example

High-level flow

How this example moves from input to execution and reviewable output

Level 4: Agent Harness —… -> Run the agent task -> Investigation scope -> Runtime activity -> Structured handoff -> Knowledge access and external…

Start

Level 4: Agent Harness —…

Checkpoint

Run the agent task

Outcome

Investigation scope

Why this page exists

This example is shown as both real source code and a product-facing interaction pattern so learners can connect implementation, UX, and doctrine without leaving the library.

Visual flowReal sourceSandbox or walkthroughMCP access

How should this example be used in the platform?

Use the sandbox to understand the experience pattern first, then inspect the source to see how the product boundary, model boundary, and doctrine boundary are actually implemented.

UX pattern: Runtime-backed investigation with explicit review artifacts
Knowledge access and external tools in one runtime
Explicit permission and budget settings
Structured output keeps a wide-capability agent reviewable

Source references

Library entry
agents-agent-complexity-4-agent-harness
Source path
content/example-library/sources/agents/agent-complexity/4-agent-harness.py
Libraries
pydantic, python-dotenv
Runtime requirements
Local repo environment
Related principles
Design for delegation rather than direct manipulation, Replace implied magic with clear mental models, Represent delegated work as a system, not merely as a conversation, Optimise for steering, not only initiating

Model context

Model-agnosticLocal-viableWrapped tool calling acceptableMedium reasoning requirementOrchestration compensates

The harness pattern adds structural guardrails that compensate for lower model quality. Local models are viable when the harness validates outputs before acting.

4-agent-harness.py

python
"""
Level 4: Agent Harness — Full Runtime Access (governed pattern)
Give the agent a full runtime via the Claude Agent SDK for INVESTIGATION,
keep destructive execution behind a separate approval gate.

NOTE: Run with `python 4-agent-harness.py` (not IPython/Jupyter).
The Claude Agent SDK uses anyio TaskGroups incompatible with nest_asyncio.

https://platform.claude.com/docs/en/agent-sdk/python

Design notes (iter-2, 2026-05-22):
The autonomous agent has READ-ONLY billing access — it can verify a
transaction but cannot issue a refund. The agent emits a typed
`RefundRecommendation`; a separate `execute_refund_with_approval()`
function runs the destructive action behind a caller-supplied
`approval_callback`. This makes the automation boundary machine-enforced
(not prompt-enforced) per Blueprint principles P5 (clear mental models)
and P8 (explicit hand-offs / approvals / blockers).

Audit trail (P7) captures every phase transition + the approval reason.
Operational phases (P6) are user-meaningful, not raw SDK message types.
"""

import asyncio
import json
import uuid
from datetime import datetime, timezone
from enum import Enum
from pathlib import Path
from typing import Awaitable, Callable

from pydantic import BaseModel
from claude_agent_sdk import (
    AssistantMessage,
    ClaudeAgentOptions,
    ClaudeSDKClient,
    ResultMessage,
    TextBlock,
    ToolUseBlock,
    tool,
    create_sdk_mcp_server,
)
from dotenv import load_dotenv

load_dotenv()

KNOWLEDGE_DIR = Path(__file__).parent / "knowledge"


# ── Tools exposed to the autonomous agent ────────────────────────────
# Only READ-ONLY billing access. `issue_refund` is INTENTIONALLY
# NOT registered as a tool — destructive execution is gated by
# `execute_refund_with_approval()` below.


@tool(
    "check_payment_gateway",
    "Check payment processor for transaction status and refund eligibility",
    {"transaction_date": str, "amount": str},
)
async def check_payment_gateway(args):
    return {
        "content": [
            {
                "type": "text",
                "text": (
                    f"Payment Gateway Response for {args['transaction_date']} — ${args['amount']}:\n"
                    "- Transaction ID: txn_8f3k2j1\n"
                    "- Status: SETTLED\n"
                    "- Refund eligible: YES\n"
                    "- Original payment method: Visa ending in 4242\n"
                    "- Settlement date: 2025-02-02"
                ),
            }
        ]
    }


# ── Typed schemas ────────────────────────────────────────────────────


class CustomerEmail(BaseModel):
    subject: str
    body: str


class RefundDecision(str, Enum):
    """What the AUTONOMOUS agent recommends. Execution is a separate step."""
    APPROVE = "approve_refund"
    DENY = "deny_refund"
    NEEDS_MORE_INFO = "needs_more_info"


class ApprovalDecision(str, Enum):
    """What the human/policy approval gate returns. Captured in audit."""
    APPROVED = "approved"
    REJECTED = "rejected"
    NEEDS_REVIEW = "needs_review"


class HarnessPhase(str, Enum):
    """User-meaningful operational states (P6)."""
    INVESTIGATING = "investigating"
    VERIFYING_TRANSACTION = "verifying_transaction"
    DRAFTING_RECOMMENDATION = "drafting_recommendation"
    AWAITING_APPROVAL = "awaiting_approval"
    APPROVAL_DECIDED = "approval_decided"
    REFUND_EXECUTED = "refund_executed"
    REFUND_BLOCKED = "refund_blocked"
    COMPLETED = "completed"
    FAILED = "failed"


class AuditEvent(BaseModel):
    """Append-only audit record (P7)."""
    timestamp: datetime
    phase: HarnessPhase
    detail: str


class RefundRecommendation(BaseModel):
    """What the autonomous agent outputs. Bound to evidence."""
    decision: RefundDecision
    amount_usd: float | None
    customer_id: str
    rationale: str
    policy_rule_applied: str  # the rule NAME, not the model's claim
    evidence_files: list[str]  # knowledge/ paths the model consulted
    payment_gateway_result: str  # verbatim output from check_payment_gateway
    customer_email: CustomerEmail


class ApprovalResult(BaseModel):
    """What the approval callback returns. `reason` is captured in audit."""
    decision: ApprovalDecision
    reason: str


class HarnessOutput(BaseModel):
    """Final harness output. `recommendation` is always set on success;
    `refund_execution_id` is ONLY set if the approval gate returned
    APPROVED and execution succeeded."""
    run_id: str
    recommendation: RefundRecommendation | None
    audit_trail: list[AuditEvent]
    final_state: HarnessPhase
    refund_execution_id: str | None = None


# ── Investigation phase: autonomous agent, READ-ONLY billing ────────


async def run_harness(task: str) -> HarnessOutput:
    """Run the autonomous investigation agent. Emits a recommendation;
    NEVER calls a refund. Caller is responsible for passing the result
    through `execute_refund_with_approval()` if execution is intended.
    """
    server = create_sdk_mcp_server(
        name="billing-api",
        version="1.0.0",
        tools=[check_payment_gateway],  # ← refund execution removed (P8)
    )

    options = ClaudeAgentOptions(
        system_prompt=(
            "You are a senior support analyst with access to:\n\n"
            f"1. A knowledge base at: {KNOWLEDGE_DIR}\n"
            "   - policies/ — refund policy, escalation matrix, subscription management\n"
            "   - customers/ — customer profiles with transaction history\n"
            "   - templates/ — response templates\n\n"
            "2. External billing API (READ-ONLY for you):\n"
            "   - check_payment_gateway — verify transaction status\n\n"
            "You CANNOT execute refunds. A separate approval step (outside your "
            "control) decides whether to execute. Your job is to investigate, "
            "then emit a RefundRecommendation. Cite every knowledge/ file you "
            "consulted by path and quote the policy rule by name. If the "
            "evidence does not support a clean approve/deny, return "
            "NEEDS_MORE_INFO with the missing facts spelled out."
        ),
        allowed_tools=[
            "Read",
            "Glob",
            "Grep",
            "mcp__billing-api__check_payment_gateway",
            # `mcp__billing-api__issue_refund` INTENTIONALLY OMITTED (P8)
        ],
        mcp_servers={"billing-api": server},
        output_format={
            "type": "json_schema",
            "schema": RefundRecommendation.model_json_schema(),
        },
        permission_mode="acceptEdits",
        max_turns=15,
        max_budget_usd=1.00,
        model="sonnet",
        cwd=str(KNOWLEDGE_DIR),
    )

    run_id = str(uuid.uuid4())
    audit: list[AuditEvent] = []

    def _log(phase: HarnessPhase, detail: str) -> None:
        audit.append(
            AuditEvent(
                timestamp=datetime.now(timezone.utc),
                phase=phase,
                detail=detail,
            )
        )

    _log(HarnessPhase.INVESTIGATING, f"task: {task[:120]}")

    recommendation: RefundRecommendation | None = None

    async with ClaudeSDKClient(options=options) as client:
        await client.query(task)
        async for message in client.receive_response():
            if isinstance(message, AssistantMessage):
                for block in message.content:
                    if isinstance(block, ToolUseBlock):
                        phase = (
                            HarnessPhase.VERIFYING_TRANSACTION
                            if "check_payment_gateway" in block.name
                            else HarnessPhase.INVESTIGATING
                        )
                        _log(phase, f"tool: {block.name}({block.input})")
                    elif isinstance(block, TextBlock):
                        # Foreground stream — concise, no console flood.
                        print(block.text)
            elif isinstance(message, ResultMessage):
                cost = (
                    f"${message.total_cost_usd:.4f}"
                    if message.total_cost_usd
                    else "n/a"
                )
                _log(
                    HarnessPhase.DRAFTING_RECOMMENDATION,
                    f"agent done · turns={message.num_turns} · cost={cost}",
                )
                if message.structured_output:
                    raw = (
                        json.loads(message.structured_output)
                        if isinstance(message.structured_output, str)
                        else message.structured_output
                    )
                    try:
                        recommendation = RefundRecommendation.model_validate(raw)
                    except Exception as exc:
                        _log(
                            HarnessPhase.FAILED,
                            f"recommendation parse failed: {exc}",
                        )

    if recommendation is None:
        _log(HarnessPhase.REFUND_BLOCKED, "agent emitted no structured recommendation")
        return HarnessOutput(
            run_id=run_id,
            recommendation=None,
            audit_trail=audit,
            final_state=HarnessPhase.REFUND_BLOCKED,
        )

    _log(
        HarnessPhase.AWAITING_APPROVAL,
        f"decision={recommendation.decision.value} · amount={recommendation.amount_usd}",
    )
    return HarnessOutput(
        run_id=run_id,
        recommendation=recommendation,
        audit_trail=audit,
        final_state=HarnessPhase.AWAITING_APPROVAL,
    )


# ── Execution phase: SEPARATE, gated by approval callback ───────────


async def execute_refund_with_approval(
    output: HarnessOutput,
    approval_callback: Callable[[RefundRecommendation], Awaitable[ApprovalResult]],
) -> HarnessOutput:
    """The ONLY path to actual refund execution. Gated by caller-supplied
    `approval_callback` — could be a human-in-loop CLI prompt, a policy
    service decision, or a teams-channel approval bot. The boundary is
    machine-enforced: even if the agent recommended APPROVE, the refund
    is not executed unless the approval callback returns APPROVED.
    """
    audit = output.audit_trail

    def _log(phase: HarnessPhase, detail: str) -> None:
        audit.append(
            AuditEvent(
                timestamp=datetime.now(timezone.utc),
                phase=phase,
                detail=detail,
            )
        )

    if output.recommendation is None:
        _log(HarnessPhase.REFUND_BLOCKED, "no recommendation to act on")
        output.final_state = HarnessPhase.REFUND_BLOCKED
        return output

    if output.recommendation.decision is not RefundDecision.APPROVE:
        _log(
            HarnessPhase.REFUND_BLOCKED,
            f"agent did not recommend approval: {output.recommendation.decision.value}",
        )
        output.final_state = HarnessPhase.REFUND_BLOCKED
        return output

    approval = await approval_callback(output.recommendation)
    _log(
        HarnessPhase.APPROVAL_DECIDED,
        f"approval={approval.decision.value} · reason={approval.reason}",
    )

    if approval.decision is not ApprovalDecision.APPROVED:
        output.final_state = HarnessPhase.REFUND_BLOCKED
        return output

    # MOCK: in production, replace with real payment gateway call.
    # This boundary is where execution becomes irreversible — keep it
    # isolated, audited, and behind the approval gate above. Wrap real
    # gateway calls in retries/idempotency keys; persist
    # `refund_execution_id` durably before returning to the caller.
    output.refund_execution_id = "ref_9x2m4p7"
    _log(
        HarnessPhase.REFUND_EXECUTED,
        f"refund_execution_id={output.refund_execution_id}",
    )
    output.final_state = HarnessPhase.REFUND_EXECUTED
    return output


# ── Example wiring: CLI approval gate ───────────────────────────────


async def _cli_approval_callback(
    recommendation: RefundRecommendation,
) -> ApprovalResult:
    """Example approval gate — prints the recommendation and asks for
    explicit confirmation. In production, replace with the appropriate
    surface (policy service, teams-channel bot, human-in-loop UI)."""
    print("\n══ AWAITING APPROVAL ══")
    print(f"Decision recommended: {recommendation.decision.value}")
    print(f"Amount: ${recommendation.amount_usd}")
    print(f"Policy rule: {recommendation.policy_rule_applied}")
    print(f"Rationale: {recommendation.rationale}")
    print(f"Evidence files: {recommendation.evidence_files}")
    raw = input("Approve? (approved / rejected / needs_review): ").strip().lower()
    reason = input("Reason (captured in audit): ").strip() or "no reason given"
    try:
        decision = ApprovalDecision(raw)
    except ValueError:
        decision = ApprovalDecision.NEEDS_REVIEW
        reason = f"invalid input '{raw}' coerced to needs_review · {reason}"
    return ApprovalResult(decision=decision, reason=reason)


async def main() -> None:
    output = await run_harness(
        "Customer cust_12345 reports a duplicate charge on their February bill. "
        "Investigate using the knowledge base, determine the right action per policy, "
        "and draft a personalized response using the appropriate template."
    )
    output = await execute_refund_with_approval(output, _cli_approval_callback)

    print("\n══ FINAL OUTPUT ══")
    print(output.model_dump_json(indent=2))


if __name__ == "__main__":
    asyncio.run(main())

What should the learner inspect in the code?

Look for the exact place where system scope is bounded: schema definitions, prompt framing, runtime configuration, and the call site that turns user intent into a concrete model or workflow action.

create_sdk_mcp_server(
allowed_tools=[
output_format={
async with ClaudeSDKClient

How does the sandbox relate to the source?

The sandbox should make the UX legible: what the user sees, what the system is deciding, and how the result becomes reviewable. The source then shows how that behavior is actually implemented.

Launch an investigation task.
Inspect the runtime trace across knowledge files and billing tools.
Review the final structured output and drafted customer email.
SandboxRuntime-backed investigation with explicit review artifacts

Full runtime investigation surface

This simulation shows what changes when an agent can read files, search internal knowledge, and call external tools inside a full runtime harness.

UX explanation

The experience should show that the system is not merely answering from a prompt. It is investigating across multiple resources, so the user needs durable visibility into what was searched, what was verified, and what action was finally taken.

AI design explanation

A harness grants much broader capability than tool-calling alone. That makes inspectability, permission boundaries, and structured output even more important, because the system can now move across a real working environment.

Interaction walkthrough

  1. 1Launch an investigation task.
  2. 2Inspect the runtime trace across knowledge files and billing tools.
  3. 3Review the final structured output and drafted customer email.

Runtime task

Customer cust_12345 reports a duplicate charge on their February bill. Investigate and draft the right response.

Knowledge filesMCP billing toolsStructured output

Runtime trace

The harness trace should expose both knowledge lookup and external tool use.

Structured handoff

A wide-capability agent still needs a final reviewable package for the product to present.

Why this needs stronger UX

  • Knowledge access and external tools in one runtime
  • Explicit permission and budget settings
  • Structured output keeps a wide-capability agent reviewable

Used in courses and paths

This example currently stands on its own in the library, but it still connects to the principle system and the broader example family.

Related principles

Runtime architecture

Use this example in your agents

This example is also available through the blueprint’s agent-ready layer. Use the For agents page for the public MCP, deterministic exports, and Claude/Cursor setup.

Define triggers, context, and boundaries before increasing autonomy
Make control, observability, and recovery explicit in the runtime
Choose the right operational patterns before delegating to workflows