Skip to main content
The ReliabilityEvaluator is a post-execution assertion engine that verifies an agent’s tool-calling behavior. It checks whether the expected tools were invoked, in the correct order if required, and flags any unexpected tool calls.

How It Works

  1. Run your agent, team, or graph to completion.
  2. Pass the completed result (a Task, List[Task], or Graph) to the evaluator.
  3. The evaluator extracts tool call history and compares it against the expected list.
  4. Returns a structured result with pass/fail status, per-tool checks, and missing/unexpected lists.

Parameters

ParameterTypeRequiredDescription
expected_tool_callsList[str]YesTool names that should have been called
order_mattersboolNoWhether call order must match (default: False)
exact_matchboolNoWhether only expected tools may be called (default: False)

Input Types

The run() method accepts:
InputSource
TaskResult of Agent.do() / Agent.do_async()
List[Task]Result of Team.multi_agent_async()
GraphA Graph instance after graph.run() / graph.run_async()

Result Structure

ReliabilityEvaluationResult contains:
  • passed — Overall pass/fail boolean
  • summary — Human-readable explanation
  • expected_tool_calls — The original expected list
  • actual_tool_calls — Ordered list of tools actually called
  • checks — List of ToolCallCheck objects (one per expected tool)
  • missing_tool_calls — Expected tools that were not invoked
  • unexpected_tool_calls — Tools called but not expected (only when exact_match=True)
Each ToolCallCheck includes:
  • tool_name — Name of the tool
  • was_called — Whether the tool was found in history
  • times_called — How many times it was invoked

Usage Examples