AccuracyEvaluator uses an LLM judge to compare an agent’s generated output against an expected answer. Each evaluation produces a score from 1 to 10 along with detailed reasoning and constructive critique.
How It Works
- The agent under test receives a query and produces output.
- A separate judge agent evaluates the output against the expected answer and guidelines.
- The judge returns a structured
EvaluationScorecontaining a numeric score, reasoning, pass/fail flag, and critique. - If
num_iterations > 1, the process repeats and scores are averaged.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
judge_agent | Agent | Yes | Agent used to evaluate outputs |
agent_under_test | Agent | Graph | Team | Yes | Entity to evaluate |
query | str | Yes | Input query sent to the entity |
expected_output | str | Yes | Ground-truth answer for comparison |
additional_guidelines | str | No | Extra criteria for the judge |
num_iterations | int | No | Number of evaluation rounds (default: 1) |
Result Structure
AccuracyEvaluationResult contains:
average_score— Mean score across all iterations (1–10)evaluation_scores— List ofEvaluationScoreobjects, one per iterationgenerated_output— The output produced by the entityuser_query/expected_output— The original inputs
EvaluationScore includes:
score— Numeric score (1–10)reasoning— Step-by-step explanation from the judgeis_met— Boolean indicating whether core requirements are metcritique— Actionable feedback on how to improve
Methods
| Method | Description |
|---|---|
await run(print_results=True) | Execute the entity, then evaluate output |
await run_with_output(output, print_results=True) | Evaluate a pre-existing output string |

