Evaluation Data Models
This document provides comprehensive documentation for all data models used in the Upsonic evaluation system.EvaluationScore
Represents the structured judgment from the LLM-as-a-judge for a single evaluation of an agent’s response.Parameters
Parameter | Type | Default | Description |
---|---|---|---|
score | float | Required | A numerical score, on a scale of 1-10, representing the quality and accuracy of the generated response when compared to the expected output and guidelines |
reasoning | str | Required | The detailed, step-by-step inner monologue of the judge, explaining exactly why the given score was assigned. This should reference the query, expected output, and guidelines |
is_met | bool | Required | A definitive boolean flag indicating if the generated output successfully meets the core requirements and spirit of the expected output |
critique | str | Required | Constructive, actionable feedback on how the agent’s response could have been improved. If the response was perfect, this can state that no improvements are needed |
Validation Rules
score
: Must be between 1 and 10 (inclusive)- All fields are required
PerformanceRunResult
Captures the raw performance metrics from a single execution run of an agent, graph, or team.Parameters
Parameter | Type | Default | Description |
---|---|---|---|
latency_seconds | float | Required | The total wall-clock time taken for the execution, measured in high-precision seconds |
memory_increase_bytes | int | Required | The net increase in memory allocated by Python objects specifically during this run, measured in bytes. This isolates the memory cost of the operation |
memory_peak_bytes | int | Required | The peak memory usage recorded at any point during this specific run, relative to the start of the run, measured in bytes |
ToolCallCheck
Represents the verification result for a single expected tool call.Parameters
Parameter | Type | Default | Description |
---|---|---|---|
tool_name | str | Required | The name of the tool that was being checked for |
was_called | bool | Required | A boolean flag that is True if the tool was found in the execution history, otherwise False |
times_called | int | Required | The total number of times this specific tool was called during the run |
AccuracyEvaluationResult
The final, aggregated result of an accuracy evaluation. This object is returned to the user and contains all inputs, outputs, and the comprehensive judgments from the evaluation process.Parameters
Parameter | Type | Default | Description |
---|---|---|---|
evaluation_scores | List[EvaluationScore] | Required | A list containing the detailed EvaluationScore object from each iteration of the test |
average_score | float | Required | The calculated average score from all evaluation iterations |
user_query | str | Required | The original input query that was provided to the agent under test |
expected_output | str | Required | The ‘gold-standard’ or ground-truth answer that was used as a benchmark for the evaluation |
generated_output | str | Required | The final output that was actually produced by the agent, graph, or team under test |
Configuration
from_attributes = True
: Allows creation from object attributes
PerformanceEvaluationResult
The final, aggregated report of a performance evaluation. It provides meaningful statistics that reveal the stability and characteristics of an agent’s performance.Parameters
Parameter | Type | Default | Description |
---|---|---|---|
all_runs | List[PerformanceRunResult] | Required | A list containing the raw PerformanceRunResult object from every measured iteration |
num_iterations | int | Required | The number of measurement runs that were performed |
warmup_runs | int | Required | The number of warmup runs that were performed before measurements began |
latency_stats | Dict[str, float] | Required | A dictionary of key statistical measures for latency (in seconds), including ‘average’, ‘median’, ‘min’, ‘max’, and ‘std_dev’ |
memory_increase_stats | Dict[str, float] | Required | A dictionary of statistical measures for the net memory increase (in bytes), including ‘average’, ‘median’, ‘min’, ‘max’, and ‘std_dev’ |
memory_peak_stats | Dict[str, float] | Required | A dictionary of statistical measures for the peak memory usage (in bytes), including ‘average’, ‘median’, ‘min’, ‘max’, and ‘std_dev’ |
Configuration
from_attributes = True
: Allows creation from object attributes
ReliabilityEvaluationResult
The final, comprehensive report of a reliability evaluation. It contains the overall pass/fail status, diagnostic tool call lists, and detailed checks for integrating into automated test suites.Parameters
Parameter | Type | Default | Description |
---|---|---|---|
passed | bool | Required | The overall pass/fail status of the entire reliability check |
summary | str | Required | A human-readable summary explaining the final outcome of the evaluation |
expected_tool_calls | List[str] | Required | The original list of tool names that the user expected to be called |
actual_tool_calls | List[str] | Required | The complete, ordered list of tool names that were actually called during the execution |
checks | List[ToolCallCheck] | Required | A detailed list of the check results for each individual expected tool |
missing_tool_calls | List[str] | Required | A convenience list containing the names of expected tools that were not found in the actual tool calls |
unexpected_tool_calls | List[str] | Required | A list of tools that were called but were not in the expected list. This is only populated if the exact_match setting was used |
Functions
assert_passed
Raises an AssertionError if the evaluation did not pass.
This method allows for seamless integration into testing frameworks like pytest. If the passed
attribute is False, an informative error is raised.
Raises:
AssertionError
: If the evaluation did not pass, with a summary of the failure
Configuration
from_attributes = True
: Allows creation from object attributes