Overview
In the Upsonic framework,DurableExecution provides fault-tolerant execution for AI agents by automatically saving execution state at each pipeline step. This enables agents to recover from failures, resume from interruptions, and maintain consistency across restarts.
The DurableExecution system serves as a reliability layer that manages:
- Automatic checkpoint creation after each pipeline step
- State persistence across multiple storage backends
- Execution recovery from the point of failure
- Error tracking and debugging information
- Execution history and analytics
- Cleanup of old execution data
Key Concepts
How Durable Execution Works
When an agent executes a task with durable execution enabled:-
Checkpoint Creation: After each successful pipeline step, the system saves a checkpoint containing:
- Task state (description, configuration, response format)
- Execution context (messages, agent state, step information)
- Current step index and name
- Execution status (paused, failed, completed)
-
Failure Handling: When a step fails:
- The system saves a checkpoint at the failed step with status=“failed”
- Error details are preserved in the checkpoint
- Execution metadata is updated
-
Recovery: To resume execution:
- Load the checkpoint from storage
- Reconstruct the task and execution context
- Retry the failed step (if status=“failed”) or continue from the next step
- Complete remaining pipeline steps
Checkpoint Strategy
The system uses an overwrite strategy - each execution has ONE checkpoint that is continuously updated:success: Individual step completed successfully (intermediate state)failed: Step failed with errorpaused: Execution paused (e.g., human-in-the-loop)completed: All steps finished successfully (final state)
DurableExecution Attributes
Core Attributes
| Attribute | Type | Description |
|---|---|---|
| execution_id | str | Unique identifier for this execution (auto-generated) |
| storage | DurableExecutionStorage | Storage backend for checkpoint persistence |
| auto_cleanup | bool | Automatically cleanup on completion (default: True) |
| debug | bool | Enable debug logging (default: False) |
Storage Backend Options
| Backend | Use Case | Persistence | Performance |
|---|---|---|---|
| InMemoryStorage | Testing, temporary executions | No | Fastest |
| FileStorage | Development, single-node systems | Yes | Fast |
| SQLiteStorage | Small to medium applications | Yes | Fast |
| RedisStorage | Distributed, high-scale systems | Yes | Very Fast |
Creating Durable Executions
Durable executions are created by attaching aDurableExecution instance to a Task. The system automatically handles checkpoint management throughout the execution lifecycle.
Basic Usage
Recovery from Failure
Storage Backends
In-Memory Storage
Fast, non-persistent storage for testing and development.- ⚡ Fastest performance
- ❌ No persistence (lost on restart)
- ✅ Perfect for testing
- ✅ No setup required
File-Based Storage
Simple, human-readable JSON files for single-node applications.| Parameter | Type | Description | Default |
|---|---|---|---|
| path | str | Directory path for checkpoint files | ”./durable_states” |
- 📁 Human-readable JSON format
- ✅ Easy debugging and inspection
- ✅ Simple backup and restore
- ⚠️ Not suitable for distributed systems
SQLite Storage
Queryable, efficient storage for small to medium applications.| Parameter | Type | Description | Default |
|---|---|---|---|
| db_path | str | SQLite database file path | ”./durable_executions.db” |
| table_name | str | Table name for checkpoints | ”durable_executions” |
- 🗄️ Single-file database
- ✅ ACID transactions
- ✅ Queryable execution history
- ✅ Efficient for thousands of executions
Redis Storage
Distributed, high-performance storage for production systems.| Parameter | Type | Description | Default |
|---|---|---|---|
| host | str | Redis server host | ”localhost” |
| port | int | Redis server port | 6379 |
| db | int | Redis database number | 0 |
| password | Optional[str] | Redis password | None |
| prefix | str | Key prefix | ”durable:state:“ |
| ttl | Optional[int] | Time-to-live in seconds | None (no expiration) |
- ⚡ Very fast in-memory performance
- ✅ Distributed architecture support
- ✅ Built-in TTL for automatic cleanup
- ✅ Scalable to millions of executions
Recovery and Continuation
Basic Recovery
Recovery with Debug Mode
Handling Different Failure Scenarios
Execution Management
Listing Executions
Getting Execution Information
Storage Statistics
Cleanup Operations
Practical Examples
Banking Transaction with Recovery
- ✅ No duplicate charges (transaction resumes from exact point)
- ✅ Complete audit trail of all steps
- ✅ Automatic recovery from network failures
- ✅ Maintains transaction consistency
Long-Running Data Processing
- ✅ Process millions of records safely
- ✅ Resume from interruption point (not restart from zero)
- ✅ Distributed processing support via Redis
- ✅ Progress tracking and monitoring
Multi-Step Workflow with Tools
- ✅ Complex multi-step workflows with tool calls
- ✅ Each step is checkpointed
- ✅ Tool results are preserved
- ✅ Resume from exact tool call
Monitoring and Analytics
Execution Recovery Dashboard
Best Practices
Storage Selection
- Development: Use
FileDurableStoragefor easy debugging - Testing: Use
InMemoryDurableStoragefor fast tests - Production (Single Node): Use
SQLiteDurableStoragefor reliability - Production (Distributed): Use
RedisDurableStoragefor scalability
Cleanup Strategy
Error Handling
Monitoring and Alerting
Performance Optimization
Complete Example
Integration Patterns
With Agent Pipelines
Summary
Durable Execution in Upsonic provides enterprise-grade reliability for AI agents: ✅ Automatic Checkpointing: Every step is saved automatically✅ Multiple Storage Backends: File, SQLite, Redis, In-Memory
✅ Seamless Recovery: Resume from exact failure point
✅ No Data Loss: Complete state preservation
✅ Production Ready: Tested across all storage backends
✅ Distributed Support: Redis-based coordination
✅ Easy Integration: Simple API, minimal code changes Perfect for mission-critical applications like banking transactions, long-running data processing, multi-step workflows, and distributed agent systems.

