Skip to main content
The AccuracyEvaluator uses an LLM judge to compare an agent’s generated output against an expected answer. Each evaluation produces a score from 1 to 10 along with detailed reasoning and constructive critique.

How It Works

  1. The agent under test receives a query and produces output.
  2. A separate judge agent evaluates the output against the expected answer and guidelines.
  3. The judge returns a structured EvaluationScore containing a numeric score, reasoning, pass/fail flag, and critique.
  4. If num_iterations > 1, the process repeats and scores are averaged.

Parameters

ParameterTypeRequiredDescription
judge_agentAgentYesAgent used to evaluate outputs
agent_under_testAgent | Graph | TeamYesEntity to evaluate
querystrYesInput query sent to the entity
expected_outputstrYesGround-truth answer for comparison
additional_guidelinesstrNoExtra criteria for the judge
num_iterationsintNoNumber of evaluation rounds (default: 1)

Result Structure

AccuracyEvaluationResult contains:
  • average_score — Mean score across all iterations (1–10)
  • evaluation_scores — List of EvaluationScore objects, one per iteration
  • generated_output — The output produced by the entity
  • user_query / expected_output — The original inputs
Each EvaluationScore includes:
  • score — Numeric score (1–10)
  • reasoning — Step-by-step explanation from the judge
  • is_met — Boolean indicating whether core requirements are met
  • critique — Actionable feedback on how to improve

Methods

MethodDescription
await run(print_results=True)Execute the entity, then evaluate output
await run_with_output(output, print_results=True)Evaluate a pre-existing output string

Usage Examples