Skip to main content

Evaluation Data Models

This document provides comprehensive documentation for all data models used in the Upsonic evaluation system.

EvaluationScore

Represents the structured judgment from the LLM-as-a-judge for a single evaluation of an agent’s response.

Parameters

ParameterTypeDefaultDescription
scorefloatRequiredA numerical score, on a scale of 1-10, representing the quality and accuracy of the generated response when compared to the expected output and guidelines
reasoningstrRequiredThe detailed, step-by-step inner monologue of the judge, explaining exactly why the given score was assigned. This should reference the query, expected output, and guidelines
is_metboolRequiredA definitive boolean flag indicating if the generated output successfully meets the core requirements and spirit of the expected output
critiquestrRequiredConstructive, actionable feedback on how the agent’s response could have been improved. If the response was perfect, this can state that no improvements are needed

Validation Rules

  • score: Must be between 1 and 10 (inclusive)
  • All fields are required

PerformanceRunResult

Captures the raw performance metrics from a single execution run of an agent, graph, or team.

Parameters

ParameterTypeDefaultDescription
latency_secondsfloatRequiredThe total wall-clock time taken for the execution, measured in high-precision seconds
memory_increase_bytesintRequiredThe net increase in memory allocated by Python objects specifically during this run, measured in bytes. This isolates the memory cost of the operation
memory_peak_bytesintRequiredThe peak memory usage recorded at any point during this specific run, relative to the start of the run, measured in bytes

ToolCallCheck

Represents the verification result for a single expected tool call.

Parameters

ParameterTypeDefaultDescription
tool_namestrRequiredThe name of the tool that was being checked for
was_calledboolRequiredA boolean flag that is True if the tool was found in the execution history, otherwise False
times_calledintRequiredThe total number of times this specific tool was called during the run

AccuracyEvaluationResult

The final, aggregated result of an accuracy evaluation. This object is returned to the user and contains all inputs, outputs, and the comprehensive judgments from the evaluation process.

Parameters

ParameterTypeDefaultDescription
evaluation_scoresList[EvaluationScore]RequiredA list containing the detailed EvaluationScore object from each iteration of the test
average_scorefloatRequiredThe calculated average score from all evaluation iterations
user_querystrRequiredThe original input query that was provided to the agent under test
expected_outputstrRequiredThe ‘gold-standard’ or ground-truth answer that was used as a benchmark for the evaluation
generated_outputstrRequiredThe final output that was actually produced by the agent, graph, or team under test

Configuration

  • from_attributes = True: Allows creation from object attributes

PerformanceEvaluationResult

The final, aggregated report of a performance evaluation. It provides meaningful statistics that reveal the stability and characteristics of an agent’s performance.

Parameters

ParameterTypeDefaultDescription
all_runsList[PerformanceRunResult]RequiredA list containing the raw PerformanceRunResult object from every measured iteration
num_iterationsintRequiredThe number of measurement runs that were performed
warmup_runsintRequiredThe number of warmup runs that were performed before measurements began
latency_statsDict[str, float]RequiredA dictionary of key statistical measures for latency (in seconds), including ‘average’, ‘median’, ‘min’, ‘max’, and ‘std_dev’
memory_increase_statsDict[str, float]RequiredA dictionary of statistical measures for the net memory increase (in bytes), including ‘average’, ‘median’, ‘min’, ‘max’, and ‘std_dev’
memory_peak_statsDict[str, float]RequiredA dictionary of statistical measures for the peak memory usage (in bytes), including ‘average’, ‘median’, ‘min’, ‘max’, and ‘std_dev’

Configuration

  • from_attributes = True: Allows creation from object attributes

ReliabilityEvaluationResult

The final, comprehensive report of a reliability evaluation. It contains the overall pass/fail status, diagnostic tool call lists, and detailed checks for integrating into automated test suites.

Parameters

ParameterTypeDefaultDescription
passedboolRequiredThe overall pass/fail status of the entire reliability check
summarystrRequiredA human-readable summary explaining the final outcome of the evaluation
expected_tool_callsList[str]RequiredThe original list of tool names that the user expected to be called
actual_tool_callsList[str]RequiredThe complete, ordered list of tool names that were actually called during the execution
checksList[ToolCallCheck]RequiredA detailed list of the check results for each individual expected tool
missing_tool_callsList[str]RequiredA convenience list containing the names of expected tools that were not found in the actual tool calls
unexpected_tool_callsList[str]RequiredA list of tools that were called but were not in the expected list. This is only populated if the exact_match setting was used

Functions

assert_passed

Raises an AssertionError if the evaluation did not pass. This method allows for seamless integration into testing frameworks like pytest. If the passed attribute is False, an informative error is raised. Raises:
  • AssertionError: If the evaluation did not pass, with a summary of the failure

Configuration

  • from_attributes = True: Allows creation from object attributes

Model Relationships

The evaluation models work together in the following hierarchy:
EvaluationScore
    ↓ (multiple)
AccuracyEvaluationResult

PerformanceRunResult
    ↓ (multiple)
PerformanceEvaluationResult

ToolCallCheck
    ↓ (multiple)
ReliabilityEvaluationResult

Usage Patterns

Accuracy Evaluation

# Individual evaluation scores are aggregated into AccuracyEvaluationResult
evaluation_scores = [EvaluationScore(...), EvaluationScore(...)]
result = AccuracyEvaluationResult(
    evaluation_scores=evaluation_scores,
    average_score=8.5,
    user_query="What is the capital of France?",
    expected_output="Paris",
    generated_output="The capital of France is Paris."
)

Performance Evaluation

# Individual run results are aggregated into PerformanceEvaluationResult
run_results = [PerformanceRunResult(...), PerformanceRunResult(...)]
result = PerformanceEvaluationResult(
    all_runs=run_results,
    num_iterations=10,
    warmup_runs=2,
    latency_stats={"average": 1.5, "median": 1.4, "min": 1.2, "max": 1.8, "std_dev": 0.2},
    memory_increase_stats={"average": 1024, "median": 1000, "min": 800, "max": 1200, "std_dev": 100},
    memory_peak_stats={"average": 2048, "median": 2000, "min": 1800, "max": 2400, "std_dev": 150}
)

Reliability Evaluation

# Individual tool checks are aggregated into ReliabilityEvaluationResult
checks = [ToolCallCheck(tool_name="search", was_called=True, times_called=2)]
result = ReliabilityEvaluationResult(
    passed=True,
    summary="All reliability checks passed.",
    expected_tool_calls=["search", "analyze"],
    actual_tool_calls=["search", "analyze"],
    checks=checks,
    missing_tool_calls=[],
    unexpected_tool_calls=[]
)

# Integration with testing frameworks
result.assert_passed()  # Raises AssertionError if failed
I