Evaluation Data Models

This document provides comprehensive documentation for all data models used in the Upsonic evaluation system.

EvaluationScore

Represents the structured judgment from the LLM-as-a-judge for a single evaluation of an agent’s response.

Parameters

Parameter	Type	Default	Description
`score`	`float`	Required	A numerical score, on a scale of 1-10, representing the quality and accuracy of the generated response when compared to the expected output and guidelines
`reasoning`	`str`	Required	The detailed, step-by-step inner monologue of the judge, explaining exactly why the given score was assigned. This should reference the query, expected output, and guidelines
`is_met`	`bool`	Required	A definitive boolean flag indicating if the generated output successfully meets the core requirements and spirit of the expected output
`critique`	`str`	Required	Constructive, actionable feedback on how the agent’s response could have been improved. If the response was perfect, this can state that no improvements are needed

Validation Rules

score: Must be between 1 and 10 (inclusive)
All fields are required

PerformanceRunResult

Captures the raw performance metrics from a single execution run of an agent, graph, or team.

Parameters

Parameter	Type	Default	Description
`latency_seconds`	`float`	Required	The total wall-clock time taken for the execution, measured in high-precision seconds
`memory_increase_bytes`	`int`	Required	The net increase in memory allocated by Python objects specifically during this run, measured in bytes. This isolates the memory cost of the operation
`memory_peak_bytes`	`int`	Required	The peak memory usage recorded at any point during this specific run, relative to the start of the run, measured in bytes

ToolCallCheck

Represents the verification result for a single expected tool call.

Parameters

Parameter	Type	Default	Description
`tool_name`	`str`	Required	The name of the tool that was being checked for
`was_called`	`bool`	Required	A boolean flag that is True if the tool was found in the execution history, otherwise False
`times_called`	`int`	Required	The total number of times this specific tool was called during the run

AccuracyEvaluationResult

The final, aggregated result of an accuracy evaluation. This object is returned to the user and contains all inputs, outputs, and the comprehensive judgments from the evaluation process.

Parameters

Parameter	Type	Default	Description
`evaluation_scores`	`List[EvaluationScore]`	Required	A list containing the detailed EvaluationScore object from each iteration of the test
`average_score`	`float`	Required	The calculated average score from all evaluation iterations
`user_query`	`str`	Required	The original input query that was provided to the agent under test
`expected_output`	`str`	Required	The ‘gold-standard’ or ground-truth answer that was used as a benchmark for the evaluation
`generated_output`	`str`	Required	The final output that was actually produced by the agent, graph, or team under test

Configuration

from_attributes = True: Allows creation from object attributes

PerformanceEvaluationResult

The final, aggregated report of a performance evaluation. It provides meaningful statistics that reveal the stability and characteristics of an agent’s performance.

Parameters

Parameter	Type	Default	Description
`all_runs`	`List[PerformanceRunResult]`	Required	A list containing the raw PerformanceRunResult object from every measured iteration
`num_iterations`	`int`	Required	The number of measurement runs that were performed
`warmup_runs`	`int`	Required	The number of warmup runs that were performed before measurements began
`latency_stats`	`Dict[str, float]`	Required	A dictionary of key statistical measures for latency (in seconds), including ‘average’, ‘median’, ‘min’, ‘max’, and ‘std_dev’
`memory_increase_stats`	`Dict[str, float]`	Required	A dictionary of statistical measures for the net memory increase (in bytes), including ‘average’, ‘median’, ‘min’, ‘max’, and ‘std_dev’
`memory_peak_stats`	`Dict[str, float]`	Required	A dictionary of statistical measures for the peak memory usage (in bytes), including ‘average’, ‘median’, ‘min’, ‘max’, and ‘std_dev’

Configuration

from_attributes = True: Allows creation from object attributes

ReliabilityEvaluationResult

The final, comprehensive report of a reliability evaluation. It contains the overall pass/fail status, diagnostic tool call lists, and detailed checks for integrating into automated test suites.

Parameters

Parameter	Type	Default	Description
`passed`	`bool`	Required	The overall pass/fail status of the entire reliability check
`summary`	`str`	Required	A human-readable summary explaining the final outcome of the evaluation
`expected_tool_calls`	`List[str]`	Required	The original list of tool names that the user expected to be called
`actual_tool_calls`	`List[str]`	Required	The complete, ordered list of tool names that were actually called during the execution
`checks`	`List[ToolCallCheck]`	Required	A detailed list of the check results for each individual expected tool
`missing_tool_calls`	`List[str]`	Required	A convenience list containing the names of expected tools that were not found in the actual tool calls
`unexpected_tool_calls`	`List[str]`	Required	A list of tools that were called but were not in the expected list. This is only populated if the `exact_match` setting was used

Functions

`assert_passed`

Raises an AssertionError if the evaluation did not pass. This method allows for seamless integration into testing frameworks like pytest. If the passed attribute is False, an informative error is raised. Raises:

AssertionError: If the evaluation did not pass, with a summary of the failure

Configuration

from_attributes = True: Allows creation from object attributes

Model Relationships

The evaluation models work together in the following hierarchy:

EvaluationScore
    ↓ (multiple)
AccuracyEvaluationResult

PerformanceRunResult
    ↓ (multiple)
PerformanceEvaluationResult

ToolCallCheck
    ↓ (multiple)
ReliabilityEvaluationResult

Usage Patterns

Accuracy Evaluation

# Individual evaluation scores are aggregated into AccuracyEvaluationResult
evaluation_scores = [EvaluationScore(...), EvaluationScore(...)]
result = AccuracyEvaluationResult(
    evaluation_scores=evaluation_scores,
    average_score=8.5,
    user_query="What is the capital of France?",
    expected_output="Paris",
    generated_output="The capital of France is Paris."
)

Performance Evaluation

# Individual run results are aggregated into PerformanceEvaluationResult
run_results = [PerformanceRunResult(...), PerformanceRunResult(...)]
result = PerformanceEvaluationResult(
    all_runs=run_results,
    num_iterations=10,
    warmup_runs=2,
    latency_stats={"average": 1.5, "median": 1.4, "min": 1.2, "max": 1.8, "std_dev": 0.2},
    memory_increase_stats={"average": 1024, "median": 1000, "min": 800, "max": 1200, "std_dev": 100},
    memory_peak_stats={"average": 2048, "median": 2000, "min": 1800, "max": 2400, "std_dev": 150}
)

Reliability Evaluation

# Individual tool checks are aggregated into ReliabilityEvaluationResult
checks = [ToolCallCheck(tool_name="search", was_called=True, times_called=2)]
result = ReliabilityEvaluationResult(
    passed=True,
    summary="All reliability checks passed.",
    expected_tool_calls=["search", "analyze"],
    actual_tool_calls=["search", "analyze"],
    checks=checks,
    missing_tool_calls=[],
    unexpected_tool_calls=[]
)

# Integration with testing frameworks
result.assert_passed()  # Raises AssertionError if failed

Agent

cache

canvas

chunkers

embeddings

evals

graph

knowledge_base

loaders

memory

messages

models

profiles

providers

reflection

reliability

schemas

storage

task

team

tools

vectordb

​Evaluation Data Models

​EvaluationScore

​Parameters

​Validation Rules

​PerformanceRunResult

​Parameters

​ToolCallCheck

​Parameters

​AccuracyEvaluationResult

​Parameters

​Configuration

​PerformanceEvaluationResult

​Parameters

​Configuration

​ReliabilityEvaluationResult

​Parameters

​Functions

​assert_passed

​Configuration

​Model Relationships

​Usage Patterns

​Accuracy Evaluation

​Performance Evaluation

​Reliability Evaluation

Evaluation Data Models

EvaluationScore

Parameters

Validation Rules

PerformanceRunResult

Parameters

ToolCallCheck

Parameters

AccuracyEvaluationResult

Parameters

Configuration

PerformanceEvaluationResult

Parameters

Configuration

ReliabilityEvaluationResult

Parameters

Functions

`assert_passed`

Configuration

Model Relationships

Usage Patterns

Accuracy Evaluation

Performance Evaluation

Reliability Evaluation