ExamplescriptadvancedRunnableagent-harness
Level 4: Agent Harness — Full Runtime Access
Give the agent a full runtime via the agent runtime SDK. It can search files, read docs, and reason through problems autonomously.
Key Facts
- Level
- advanced
- Runtime
- Python • Pydantic + Python Dotenv
- Pattern
- Runtime-backed investigation with explicit review artifacts
- Interaction
- Live sandbox • Script
- Updated
- 14 March 2026
Navigate this example
Library
Browse examplesReopen the wider library to compare adjacent patterns and linked learning paths.Interaction
Run sandbox nowTry the interaction directly in this example’s guided sandbox surface.Source
Open full sourceRead the real implementation, highlighted checkpoints, and runtime requirements.MCP
Call via MCPUse the same resource inside agents, deterministic exports, and MCP setup flows.
Linked principles
Model context
Model-agnosticLocal-viableWrapped tool calling acceptableMedium reasoning requirementOrchestration compensates
The harness pattern adds structural guardrails that compensate for lower model quality. Local models are viable when the harness validates outputs before acting.
4-agent-harness.py
python
"""
Level 4: Agent Harness — Full Runtime Access (governed pattern)
Give the agent a full runtime via the Claude Agent SDK for INVESTIGATION,
keep destructive execution behind a separate approval gate.
NOTE: Run with `python 4-agent-harness.py` (not IPython/Jupyter).
The Claude Agent SDK uses anyio TaskGroups incompatible with nest_asyncio.
https://platform.claude.com/docs/en/agent-sdk/python
Design notes (iter-2, 2026-05-22):
The autonomous agent has READ-ONLY billing access — it can verify a
transaction but cannot issue a refund. The agent emits a typed
`RefundRecommendation`; a separate `execute_refund_with_approval()`
function runs the destructive action behind a caller-supplied
`approval_callback`. This makes the automation boundary machine-enforced
(not prompt-enforced) per Blueprint principles P5 (clear mental models)
and P8 (explicit hand-offs / approvals / blockers).
Audit trail (P7) captures every phase transition + the approval reason.
Operational phases (P6) are user-meaningful, not raw SDK message types.
"""
import asyncio
import json
import uuid
from datetime import datetime, timezone
from enum import Enum
from pathlib import Path
from typing import Awaitable, Callable
from pydantic import BaseModel
from claude_agent_sdk import (
AssistantMessage,
ClaudeAgentOptions,
ClaudeSDKClient,
ResultMessage,
TextBlock,
ToolUseBlock,
tool,
create_sdk_mcp_server,
)
from dotenv import load_dotenv
load_dotenv()
KNOWLEDGE_DIR = Path(__file__).parent / "knowledge"
# ── Tools exposed to the autonomous agent ────────────────────────────
# Only READ-ONLY billing access. `issue_refund` is INTENTIONALLY
# NOT registered as a tool — destructive execution is gated by
# `execute_refund_with_approval()` below.
@tool(
"check_payment_gateway",
"Check payment processor for transaction status and refund eligibility",
{"transaction_date": str, "amount": str},
)
async def check_payment_gateway(args):
return {
"content": [
{
"type": "text",
"text": (
f"Payment Gateway Response for {args['transaction_date']} — ${args['amount']}:\n"
"- Transaction ID: txn_8f3k2j1\n"
"- Status: SETTLED\n"
"- Refund eligible: YES\n"
"- Original payment method: Visa ending in 4242\n"
"- Settlement date: 2025-02-02"
),
}
]
}
# ── Typed schemas ────────────────────────────────────────────────────
class CustomerEmail(BaseModel):
subject: str
body: str
class RefundDecision(str, Enum):
"""What the AUTONOMOUS agent recommends. Execution is a separate step."""
APPROVE = "approve_refund"
DENY = "deny_refund"
NEEDS_MORE_INFO = "needs_more_info"
class ApprovalDecision(str, Enum):
"""What the human/policy approval gate returns. Captured in audit."""
APPROVED = "approved"
REJECTED = "rejected"
NEEDS_REVIEW = "needs_review"
class HarnessPhase(str, Enum):
"""User-meaningful operational states (P6)."""
INVESTIGATING = "investigating"
VERIFYING_TRANSACTION = "verifying_transaction"
DRAFTING_RECOMMENDATION = "drafting_recommendation"
AWAITING_APPROVAL = "awaiting_approval"
APPROVAL_DECIDED = "approval_decided"
REFUND_EXECUTED = "refund_executed"
REFUND_BLOCKED = "refund_blocked"
COMPLETED = "completed"
FAILED = "failed"
class AuditEvent(BaseModel):
"""Append-only audit record (P7)."""
timestamp: datetime
phase: HarnessPhase
detail: str
class RefundRecommendation(BaseModel):
"""What the autonomous agent outputs. Bound to evidence."""
decision: RefundDecision
amount_usd: float | None
customer_id: str
rationale: str
policy_rule_applied: str # the rule NAME, not the model's claim
evidence_files: list[str] # knowledge/ paths the model consulted
payment_gateway_result: str # verbatim output from check_payment_gateway
customer_email: CustomerEmail
class ApprovalResult(BaseModel):
"""What the approval callback returns. `reason` is captured in audit."""
decision: ApprovalDecision
reason: str
class HarnessOutput(BaseModel):
"""Final harness output. `recommendation` is always set on success;
`refund_execution_id` is ONLY set if the approval gate returned
APPROVED and execution succeeded."""
run_id: str
recommendation: RefundRecommendation | None
audit_trail: list[AuditEvent]
final_state: HarnessPhase
refund_execution_id: str | None = None
# ── Investigation phase: autonomous agent, READ-ONLY billing ────────
async def run_harness(task: str) -> HarnessOutput:
"""Run the autonomous investigation agent. Emits a recommendation;
NEVER calls a refund. Caller is responsible for passing the result
through `execute_refund_with_approval()` if execution is intended.
"""
server = create_sdk_mcp_server(
name="billing-api",
version="1.0.0",
tools=[check_payment_gateway], # ← refund execution removed (P8)
)
options = ClaudeAgentOptions(
system_prompt=(
"You are a senior support analyst with access to:\n\n"
f"1. A knowledge base at: {KNOWLEDGE_DIR}\n"
" - policies/ — refund policy, escalation matrix, subscription management\n"
" - customers/ — customer profiles with transaction history\n"
" - templates/ — response templates\n\n"
"2. External billing API (READ-ONLY for you):\n"
" - check_payment_gateway — verify transaction status\n\n"
"You CANNOT execute refunds. A separate approval step (outside your "
"control) decides whether to execute. Your job is to investigate, "
"then emit a RefundRecommendation. Cite every knowledge/ file you "
"consulted by path and quote the policy rule by name. If the "
"evidence does not support a clean approve/deny, return "
"NEEDS_MORE_INFO with the missing facts spelled out."
),
allowed_tools=[
"Read",
"Glob",
"Grep",
"mcp__billing-api__check_payment_gateway",
# `mcp__billing-api__issue_refund` INTENTIONALLY OMITTED (P8)
],
mcp_servers={"billing-api": server},
output_format={
"type": "json_schema",
"schema": RefundRecommendation.model_json_schema(),
},
permission_mode="acceptEdits",
max_turns=15,
max_budget_usd=1.00,
model="sonnet",
cwd=str(KNOWLEDGE_DIR),
)
run_id = str(uuid.uuid4())
audit: list[AuditEvent] = []
def _log(phase: HarnessPhase, detail: str) -> None:
audit.append(
AuditEvent(
timestamp=datetime.now(timezone.utc),
phase=phase,
detail=detail,
)
)
_log(HarnessPhase.INVESTIGATING, f"task: {task[:120]}")
recommendation: RefundRecommendation | None = None
async with ClaudeSDKClient(options=options) as client:
await client.query(task)
async for message in client.receive_response():
if isinstance(message, AssistantMessage):
for block in message.content:
if isinstance(block, ToolUseBlock):
phase = (
HarnessPhase.VERIFYING_TRANSACTION
if "check_payment_gateway" in block.name
else HarnessPhase.INVESTIGATING
)
_log(phase, f"tool: {block.name}({block.input})")
elif isinstance(block, TextBlock):
# Foreground stream — concise, no console flood.
print(block.text)
elif isinstance(message, ResultMessage):
cost = (
f"${message.total_cost_usd:.4f}"
if message.total_cost_usd
else "n/a"
)
_log(
HarnessPhase.DRAFTING_RECOMMENDATION,
f"agent done · turns={message.num_turns} · cost={cost}",
)
if message.structured_output:
raw = (
json.loads(message.structured_output)
if isinstance(message.structured_output, str)
else message.structured_output
)
try:
recommendation = RefundRecommendation.model_validate(raw)
except Exception as exc:
_log(
HarnessPhase.FAILED,
f"recommendation parse failed: {exc}",
)
if recommendation is None:
_log(HarnessPhase.REFUND_BLOCKED, "agent emitted no structured recommendation")
return HarnessOutput(
run_id=run_id,
recommendation=None,
audit_trail=audit,
final_state=HarnessPhase.REFUND_BLOCKED,
)
_log(
HarnessPhase.AWAITING_APPROVAL,
f"decision={recommendation.decision.value} · amount={recommendation.amount_usd}",
)
return HarnessOutput(
run_id=run_id,
recommendation=recommendation,
audit_trail=audit,
final_state=HarnessPhase.AWAITING_APPROVAL,
)
# ── Execution phase: SEPARATE, gated by approval callback ───────────
async def execute_refund_with_approval(
output: HarnessOutput,
approval_callback: Callable[[RefundRecommendation], Awaitable[ApprovalResult]],
) -> HarnessOutput:
"""The ONLY path to actual refund execution. Gated by caller-supplied
`approval_callback` — could be a human-in-loop CLI prompt, a policy
service decision, or a teams-channel approval bot. The boundary is
machine-enforced: even if the agent recommended APPROVE, the refund
is not executed unless the approval callback returns APPROVED.
"""
audit = output.audit_trail
def _log(phase: HarnessPhase, detail: str) -> None:
audit.append(
AuditEvent(
timestamp=datetime.now(timezone.utc),
phase=phase,
detail=detail,
)
)
if output.recommendation is None:
_log(HarnessPhase.REFUND_BLOCKED, "no recommendation to act on")
output.final_state = HarnessPhase.REFUND_BLOCKED
return output
if output.recommendation.decision is not RefundDecision.APPROVE:
_log(
HarnessPhase.REFUND_BLOCKED,
f"agent did not recommend approval: {output.recommendation.decision.value}",
)
output.final_state = HarnessPhase.REFUND_BLOCKED
return output
approval = await approval_callback(output.recommendation)
_log(
HarnessPhase.APPROVAL_DECIDED,
f"approval={approval.decision.value} · reason={approval.reason}",
)
if approval.decision is not ApprovalDecision.APPROVED:
output.final_state = HarnessPhase.REFUND_BLOCKED
return output
# MOCK: in production, replace with real payment gateway call.
# This boundary is where execution becomes irreversible — keep it
# isolated, audited, and behind the approval gate above. Wrap real
# gateway calls in retries/idempotency keys; persist
# `refund_execution_id` durably before returning to the caller.
output.refund_execution_id = "ref_9x2m4p7"
_log(
HarnessPhase.REFUND_EXECUTED,
f"refund_execution_id={output.refund_execution_id}",
)
output.final_state = HarnessPhase.REFUND_EXECUTED
return output
# ── Example wiring: CLI approval gate ───────────────────────────────
async def _cli_approval_callback(
recommendation: RefundRecommendation,
) -> ApprovalResult:
"""Example approval gate — prints the recommendation and asks for
explicit confirmation. In production, replace with the appropriate
surface (policy service, teams-channel bot, human-in-loop UI)."""
print("\n══ AWAITING APPROVAL ══")
print(f"Decision recommended: {recommendation.decision.value}")
print(f"Amount: ${recommendation.amount_usd}")
print(f"Policy rule: {recommendation.policy_rule_applied}")
print(f"Rationale: {recommendation.rationale}")
print(f"Evidence files: {recommendation.evidence_files}")
raw = input("Approve? (approved / rejected / needs_review): ").strip().lower()
reason = input("Reason (captured in audit): ").strip() or "no reason given"
try:
decision = ApprovalDecision(raw)
except ValueError:
decision = ApprovalDecision.NEEDS_REVIEW
reason = f"invalid input '{raw}' coerced to needs_review · {reason}"
return ApprovalResult(decision=decision, reason=reason)
async def main() -> None:
output = await run_harness(
"Customer cust_12345 reports a duplicate charge on their February bill. "
"Investigate using the knowledge base, determine the right action per policy, "
"and draft a personalized response using the appropriate template."
)
output = await execute_refund_with_approval(output, _cli_approval_callback)
print("\n══ FINAL OUTPUT ══")
print(output.model_dump_json(indent=2))
if __name__ == "__main__":
asyncio.run(main())
Related principles
- P1delegationDesign for delegation rather than direct manipulationDesign experiences around the assignment of work, the expression of intent, the setting of constraints, and the review of results, rather than requiring users to execute each step manually.Open principle →
- P5delegationReplace implied magic with clear mental modelsThe product should help users understand what the system can do, what it is currently doing, what it cannot do, and what conditions govern its behaviour.Open principle →
- P9orchestrationRepresent delegated work as a system, not merely as a conversationWhere work involves multiple steps, agents, dependencies, or concurrent activities, it should be represented as a structured system rather than solely as a message stream.Open principle →
- P10delegationOptimise for steering, not only initiatingThe system should support users not only in starting tasks, but also in guiding, refining, reprioritising, and correcting work while it is underway.Open principle →