Application GuideRecovery Design

Retry is not recovery.

Systems that silently retry, or vanish mid-task, are not recovering. Real recovery shows where work stopped, what survived, and how to proceed from here.

Updated April 22, 2026

Key Facts

Best fit: Ops, support, finance, and enterprise teams running multi-step agents over minutes, hours, or days
Primary risk: Silent abandonment and duplicate side effects during replay
Core shift: Retry loop → explicit recovery system
Success signal: Runs end in named states with preserved partial outputs and clear resume actions
Doctrine mapping: P2, P8, P9, P10

In this section

Why long-running agent recovery matters now

Long-running agents now stretch across minutes, hours, or days, touching tools, memory, and external systems along the way.…

Open →

Why the standard retry-first approach fails for long-running agents

Most teams treat recovery as an infrastructure concern: retry the call, restart the worker, or rerun the whole…

Open →

How Blueprint replaces retry-first recovery for long-running agents

Blueprint turns recovery from a background mechanism into a first-class run surface. Using P8 – Make hand-offs, approvals,…

Open →

How to implement long-running agent recovery

Start by defining the smallest recoverable unit of work and the safest place to resume after interruption. P6…

Open →

Escalation and governance tiers

Use these tiers to separate safe recovery moves from actions that need review, following P8 – Make hand-offs,…

Open →

Anti-patterns vs. Blueprint patterns

Use this comparison to replace opaque recovery behavior with inspectable run design under P2 – Ensure that background…

Open →

Real-world proof

Two anonymised traces show how explicit recovery beats silent reruns. Team used a checkpointed finance operations runner. Agent…

Open →

Frequently asked questions

Common implementation questions for teams adopting explicit recovery paths in agentic systems.

Open →

Getting started checklist

- Map your workflow into named run states such as running, blocked, awaiting approval, partial complete, recovered, and…

Open →

Next steps

Audit one existing long-running workflow and mark where it can fail, stall, or produce partial outputs. Then use…

Open →

Recovery is part of the product, not just the runtime

Long-running agents now operate across memory, tools, external APIs, and delayed approvals, which means failure is rarely a single error and more often an interrupted process with partial side effects. If your system only retries in the background or abandons the run, your team loses the ability to understand progress, protect against duplicate actions, and resume work safely. Recovery design gives each interruption a visible state, a known checkpoint, and a human-readable next step. Written by the AI Design Blueprint editorial team. Doctrine grounded in the 10 Blueprint Principles.

Why long-running agent recovery matters now

Long-running agents now stretch across minutes, hours, or days, touching tools, memory, and external systems along the way. Recovery design matters because P2 – Ensure that background work remains perceptible and P9 – Represent delegated work as a system, not merely as a conversation require interrupted work to stay visible, resumable, and accountable.

Without explicit recovery, a correct-looking final output can hide duplicate writes, stale context, or lost partial work.

Snapshotting and rehydration make technical restart easier, but your product still needs human-visible failure states and restart boundaries.

The design target is safe continuation, not perfect uptime.

Why the standard retry-first approach fails for long-running agents

Most teams treat recovery as an infrastructure concern: retry the call, restart the worker, or rerun the whole task. That fails in agentic systems because P5 – Replace implied magic with clear mental models and P8 – Make hand-offs, approvals, and blockers explicit require users to know what was attempted, what completed, and what now needs a decision.

Hidden retry loops obscure whether the agent is making progress or compounding the same error.

Full reruns mix harmless read steps with risky side effects, creating duplicate emails, double posts, or conflicting edits.

Silent abandonment leaves orphaned partial outputs with no owner, no next action, and no recovery path.

Conversation-only logs make it hard to inspect the exact checkpoint to resume from, which weakens P7 – Establish trust through inspectability.

How Blueprint replaces retry-first recovery for long-running agents

Blueprint turns recovery from a background mechanism into a first-class run surface. Using P8 – Make hand-offs, approvals, and blockers explicit, P9 – Represent delegated work as a system, not merely as a conversation, and P10 – Optimise for steering, not only initiating, you expose each run as state, checkpoint, partial completion, and available recovery action.

Model named states such as running, blocked, awaiting approval, partial complete, recovered, and failed closed under P6 – Expose meaningful operational state, not internal complexity.

Checkpoint around meaningful business boundaries rather than every token or tool call.

Separate reversible recovery actions from approval-gated recovery actions.

Preserve completed artifacts and unresolved gaps in a partial completion ledger instead of flattening everything into success or failure.

Let operators steer recovery by retrying from checkpoint, replacing an input, skipping a reversible step, or terminating with a reason.

How to implement long-running agent recovery

Start by defining the smallest recoverable unit of work and the safest place to resume after interruption. P6 – Expose meaningful operational state, not internal complexity and P7 – Establish trust through inspectability mean your checkpoints must capture business-relevant progress, not just low-level traces.

Create checkpoints before and after every external side effect; store intent, input references, artifact IDs, approval state, and an idempotency key.

Classify failures into retryable, recoverable-with-steering, approval-required, and failed-closed states using P8 – Make hand-offs, approvals, and blockers explicit.

Record partial completion as a first-class output: completed items, skipped items, pending items, and assumptions.

On resume, compare current context with checkpointed context to detect drift before replay, supporting P5 – Replace implied magic with clear mental models.

Show the next action in plain language such as safe to retry, needs approval, source changed, or manual fix required.

Task: complete a multi-step workflow over durable checkpoints

Escalation and governance tiers

Use these tiers to separate safe recovery moves from actions that need review, following P8 – Make hand-offs, approvals, and blockers explicit.

Tier 1 — Autonomous

Replay read-only steps, refresh expired context, or resume from a checkpoint without repeating side effects

Risk level: Low

Required approval: Pre-approved at task start

Tier 2 — Guided recovery

Resume with operator-selected input, skip a reversible step, or close a partial completion note

Risk level: Medium

Required approval: Operator approval in the run view

Tier 3 — Human decision

Repeat an external write, alter business rules, discard partial work, or override a blocker

Risk level: High

Required approval: Explicit human approval for the specific action

Anti-patterns vs. Blueprint patterns

Use this comparison to replace opaque recovery behavior with inspectable run design under P2 – Ensure that background work remains perceptible and P9 – Represent delegated work as a system, not merely as a conversation.

Anti-pattern

Silent background retries with no visible state change

Blueprint pattern

Named retry state with attempt count, current blocker, and human stop or resume controls

Anti-pattern

Rerun the whole workflow after any error

Blueprint pattern

Resume from the last safe checkpoint with idempotency protection around side effects

Anti-pattern

Treat partial output as total failure

Blueprint pattern

Partial completion ledger showing what finished, what is pending, and what assumptions were used

Anti-pattern

Generic error messages such as system failed

Blueprint pattern

Operational failure states translated into user language with the required next action

Anti-pattern

Chat transcript as the only recovery surface

Blueprint pattern

Persistent run view with checkpoints, approvals, blockers, and available recovery paths

Real-world proof

Two anonymised traces show how explicit recovery beats silent reruns.

“Team used a checkpointed finance operations runner. Agent attempted to reconcile 312 invoices and post status updates. System surfaced a blocker after duplicate-write risk was detected on resume from a network timeout. Operator approved replay from the last read-only checkpoint rather than rerunning the batch. The team preserved completed records and avoided double posting.”

“Team used a document triage workflow with durable snapshots. Agent attempted to extract clauses from 1,400 contracts, but a retrieval dependency changed mid-run. System marked earlier outputs as partial completion and escalated because downstream summaries no longer matched the source set. Reviewer resumed with a frozen corpus. The run finished with traceable gaps instead of silent corruption.”

Frequently asked questions

Common implementation questions for teams adopting explicit recovery paths in agentic systems.

What counts as a checkpoint in an agentic workflow?

A checkpoint is a durable recovery boundary that stores enough state to resume without guessing. Under P6 – Expose meaningful operational state, not internal complexity, it should reflect business progress such as items processed, approvals received, outputs created, and side effects already committed.

How often should a long-running agent checkpoint?

Checkpoint at meaningful business boundaries, especially before and after external side effects, not after every token or micro-step. P6 – Expose meaningful operational state, not internal complexity favors checkpoints that map to user-understandable progress and safe resume points.

When is auto-retry acceptable instead of escalation?

Auto-retry is appropriate when the step is reversible, side-effect free, and unlikely to change the meaning of the run if replayed. If replay could duplicate an action, alter a business outcome, or hide a blocker, P8 – Make hand-offs, approvals, and blockers explicit means the system should surface the decision instead.

How should partial completion be shown to users?

Show partial completion as a legitimate run outcome, not as a disguised failure. P9 – Represent delegated work as a system, not merely as a conversation means users should see completed items, pending items, skipped items, and the assumptions that shaped those results.

What must appear in a recovery UI?

At minimum, show the current state, last safe checkpoint, completed work, unresolved blocker, next recommended action, and who can approve it. That combination follows P2 – Ensure that background work remains perceptible and P10 – Optimise for steering, not only initiating.

How do we prevent duplicate side effects after resume?

Use idempotency keys, write-ahead intent records, and checkpoint metadata that identifies which external effects were already committed. P7 – Establish trust through inspectability is important here because operators need evidence for why a replay is safe or why a write must stay blocked.

Do all recovery actions need human approval?

No. P8 – Make hand-offs, approvals, and blockers explicit supports tiered governance, where low-risk replay can be pre-approved while risky writes, rule changes, and partial-work disposal require specific human approval.

Getting started checklist

Map your workflow into named run states such as running, blocked, awaiting approval, partial complete, recovered, and failed closed using P6 – Expose meaningful operational state, not internal complexity.

Insert checkpoints before and after every external side effect.

Store idempotency keys, source references, approvals, artifact IDs, and recovery metadata in each checkpoint.

Define which recovery actions are autonomous, approval-gated, or manual using P8 – Make hand-offs, approvals, and blockers explicit.

Add a run view that shows completed work, pending work, blockers, and resume options using P2 – Ensure that background work remains perceptible and P10 – Optimise for steering, not only initiating.

Open Blueprint to validate your architecture.

Next steps

Audit one existing long-running workflow and mark where it can fail, stall, or produce partial outputs. Then use P2 – Ensure that background work remains perceptible and P10 – Optimise for steering, not only initiating to add a run view and explicit recovery actions before you widen autonomy.

Run recovery drills for network timeout, tool schema change, approval delay, and context drift.

Track resume rate, failed-closed rate, duplicate-side-effect rate, and time to human hand-off.

Review broader permissions only after recovery paths are inspectable and operators can intervene with confidence under P7 – Establish trust through inspectability.

Apply the doctrine

Validate in Pro

Make hand offs approvals and blockers explicit

Read →Complete Foundations

Represent delegated work as a system not merely as a conversation

Read →