ReliabilityEvaluator is a post-execution assertion engine that verifies an agent’s tool-calling behavior. It checks whether the expected tools were invoked, in the correct order if required, and flags any unexpected tool calls.
How It Works
- Run your agent, team, or graph to completion.
- Pass the completed result (a
Task,List[Task], orGraph) to the evaluator. - The evaluator extracts tool call history and compares it against the expected list.
- Returns a structured result with pass/fail status, per-tool checks, and missing/unexpected lists.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
expected_tool_calls | List[str] | Yes | Tool names that should have been called |
order_matters | bool | No | Whether call order must match (default: False) |
exact_match | bool | No | Whether only expected tools may be called (default: False) |
Input Types
Therun() method accepts:
| Input | Source |
|---|---|
Task | Result of Agent.do() / Agent.do_async() |
List[Task] | Result of Team.multi_agent_async() |
Graph | A Graph instance after graph.run() / graph.run_async() |
Result Structure
ReliabilityEvaluationResult contains:
passed— Overall pass/fail booleansummary— Human-readable explanationexpected_tool_calls— The original expected listactual_tool_calls— Ordered list of tools actually calledchecks— List ofToolCallCheckobjects (one per expected tool)missing_tool_calls— Expected tools that were not invokedunexpected_tool_calls— Tools called but not expected (only whenexact_match=True)
ToolCallCheck includes:
tool_name— Name of the toolwas_called— Whether the tool was found in historytimes_called— How many times it was invoked

