Skip to main content
Upsonic provides a built-in evaluation framework to systematically test and benchmark your AI agents, teams, and graphs. Evaluations help you ensure that your AI workflows meet quality, performance, and reliability standards before deploying to production.

Evaluation Types

Accuracy

LLM-as-a-judge evaluation that scores agent output quality against expected answers on a 1–10 scale.

Performance

Latency and memory profiling with statistical analysis across multiple iterations.

Reliability

Tool-call verification that asserts expected tools were invoked during execution.

Quick Start

Install the required dependencies and run your first evaluation in minutes.
import asyncio
from upsonic import Agent
from upsonic.eval import AccuracyEvaluator

agent = Agent(
    model="anthropic/claude-sonnet-4-5",
    name="Assistant",
)

judge = Agent(
    model="anthropic/claude-sonnet-4-5",
    name="Judge",
)

evaluator = AccuracyEvaluator(
    judge_agent=judge,
    agent_under_test=agent,
    query="What is the capital of France?",
    expected_output="Paris is the capital of France.",
    additional_guidelines="Check if the answer correctly identifies Paris.",
    num_iterations=1,
)

result = asyncio.run(evaluator.run())

print(f"Score: {result.average_score}/10")
print(f"Passed: {result.evaluation_scores[0].is_met}")

Supported Entities

Every evaluator works with all three core entities:
EntityDescription
AgentSingle agent executing a task
TeamMulti-agent team in sequential, coordinate, or route mode
GraphDAG-based workflow with chained task nodes