Evals

Upsonic provides a built-in evaluation framework to systematically test and benchmark your AI agents, teams, and graphs. Evaluations help you ensure that your AI workflows meet quality, performance, and reliability standards before deploying to production.

Evaluation Types

Accuracy

LLM-as-a-judge evaluation that scores agent output quality against expected answers on a 1–10 scale.

Performance

Latency and memory profiling with statistical analysis across multiple iterations.

Reliability

Tool-call verification that asserts expected tools were invoked during execution.

Quick Start

Install the required dependencies and run your first evaluation in minutes.

import asyncio
from upsonic import Agent
from upsonic.eval import AccuracyEvaluator

agent = Agent(
    model="anthropic/claude-sonnet-4-5",
    name="Assistant",
)

judge = Agent(
    model="anthropic/claude-sonnet-4-5",
    name="Judge",
)

evaluator = AccuracyEvaluator(
    judge_agent=judge,
    agent_under_test=agent,
    query="What is the capital of France?",
    expected_output="Paris is the capital of France.",
    additional_guidelines="Check if the answer correctly identifies Paris.",
    num_iterations=1,
)

result = asyncio.run(evaluator.run())

print(f"Score: {result.average_score}/10")
print(f"Passed: {result.evaluation_scores[0].is_met}")

Supported Entities

Every evaluator works with all three core entities:

Entity	Description
Agent	Single agent executing a task
Team	Multi-agent team in sequential, coordinate, or route mode
Graph	DAG-based workflow with chained task nodes

GET STARTED

CONCEPTS

STARTING AN AGENT PROJECT

READY TO USE SNIPPETS

DEPLOYMENT

FURTHER READINGS

Evaluation Types

Accuracy

Performance

Reliability

Quick Start

Supported Entities

GET STARTED

CONCEPTS

STARTING AN AGENT PROJECT

READY TO USE SNIPPETS

DEPLOYMENT

FURTHER READINGS

​Evaluation Types

Accuracy

Performance

Reliability

​Quick Start

​Supported Entities

Evaluation Types

Quick Start

Supported Entities