---
name: long-running-task-checkpointer
description: Wrap an agent loop so every step snapshots state to disk, letting a crashed agent resume from the last good state instead of restarting from scratch.
title: Long-Running Task Checkpointer
category: agents-workflows
difficulty: advanced
icon: 💾
input: structured-data
output: structured-json
phase: post
domain: ops
tags: checkpointing,fault-tolerance,agent-recovery,state-persistence,workflow-resilience,middleware,crash-recovery,snapshot,replay,long-running-tasks,agent-frameworks,distributed-systems
best_for:
  - Multi-hour agentic tasks with high failure risk
  - Production agent deployments requiring uptime
  - Research agents processing large datasets
  - Cost-sensitive environments where restart overhead is prohibitive
---

## Description

A middleware layer for agent frameworks. After each tool call, it writes a compact state snapshot (conversation history, working variables, tool output hashes) to disk. On restart, it replays the snapshot and resumes from the next step. Designed for multi-hour agentic tasks that currently lose all progress on a crash or timeout.

## Why it works

Agent frameworks keep state in memory by default, so any crash — OOM, preemption, timeout, power loss — wastes everything done so far. Most agent work is deterministic enough that replay works; adding a checkpoint per step lets you recover cheaply without changing the agent's logic.

## How it works

1. Wrap the agent's step function: before returning, serialize (messages, state, last_tool_output_hash) into a numbered checkpoint file. 2. Each checkpoint includes a 'valid' flag that only flips on after the tool completes cleanly. 3. On resume, load the highest-numbered valid checkpoint and feed its state back into the agent. 4. Optionally support checkpoint pruning — keep the last N or the last per day. 5. Expose a CLI to inspect, replay, or fork from any checkpoint.