What is Durable Execution
Durable Execution provides fault-tolerant execution for AI agents by automatically saving execution state at each pipeline step. This enables agents to recover from failures, resume from interruptions, and maintain consistency across restarts. The DurableExecution system manages:- Automatic checkpoint creation after each pipeline step
- State persistence across multiple storage backends
- Execution recovery from the point of failure
- Error tracking and debugging information
- Execution history and analytics
- Cleanup of old execution data
How Durable Execution Works
When an agent executes a task with durable execution enabled:-
Checkpoint Creation: After each successful pipeline step, the system saves a checkpoint containing:
- Task state (description, configuration, response format)
- Execution context (messages, agent state, step information)
- Current step index and name
- Execution status (running, paused, failed, completed)
-
Failure Handling: When a step fails:
- The system saves a checkpoint at the failed step with status=“failed”
- Error details are preserved in the checkpoint
- Execution metadata is updated
-
Recovery: To resume execution:
- Load the checkpoint from storage
- Reconstruct the task and execution context
- Retry the failed step (if status=“failed”) or continue from the next step
- Complete remaining pipeline steps
Checkpoint Strategy
The system uses an overwrite strategy - each execution has ONE checkpoint that is continuously updated:running: Individual step in progress or completed successfully (intermediate state)failed: Step failed with errorpaused: Execution paused (e.g., human-in-the-loop)completed: All steps finished successfully (final state)

