Skip to main content

3 posts tagged with "v0.62.0"

View All Tags

Evaluation SDK

The Evaluation SDK lets you run evaluations programmatically from code. You get full control over test data and evaluation logic. You can evaluate agents built with any framework and view results in the Agenta dashboard.

Why Programmatic Evaluation?

Complex AI agents need evaluation that goes beyond UI-based testing. The Evaluation SDK provides code-level control over test data and evaluation logic. You can test agents built with any framework. Run evaluations in your CI/CD pipeline. Debug complex workflows with full trace visibility.

Key Capabilities

Test Data Management

Create test sets directly in your code or fetch existing ones from Agenta. Test sets can include ground truth data for reference-based evaluation or work without it for evaluators that only need the output.

Built-in Evaluators

The SDK includes LLM-as-a-Judge, semantic similarity, and regex matching evaluators. You can also write custom Python evaluators for your specific requirements.

Reusable Configurations

Save evaluator configurations in Agenta to reuse them across runs. Configure an evaluator once, then reference it in multiple evaluations.

Span-Level Evaluation

Evaluate your agent end to end or test specific spans in the execution trace. Test individual components like retrieval steps or tool calls separately.

Run on Your Infrastructure

Evaluations run on your infrastructure. Results appear in the Agenta dashboard with full traces and comparison views.

Getting Started

Install the SDK:

pip install agenta

Here's a minimal example evaluating a simple agent:

import agenta as ag
from agenta.sdk.evaluations import aevaluate

# Initialize
ag.init()

# Define your application
@ag.application(slug="my_agent")
async def my_agent(question: str):
# Your agent logic here
return answer

# Define an evaluator
@ag.evaluator(slug="correctness_check")
async def correctness_check(expected: str, outputs: str):
return {
"score": 1.0 if outputs == expected else 0.0,
"success": outputs == expected,
}

# Create test data
testset = await ag.testsets.acreate(
name="Agent Tests",
data=[
{"question": "What is 2+2?", "expected": "4"},
{"question": "What is the capital of France?", "expected": "Paris"},
],
)

# Run evaluation
result = await aevaluate(
name="Agent Correctness Test",
testsets=[testset.id],
applications=[my_agent],
evaluators=[correctness_check],
)

print(f"View results: {result['dashboard_url']}")

Dashboard Integration

Every evaluation run gets a shareable dashboard link. The dashboard shows full execution traces, comparison views for different versions, aggregated metrics, and individual test case details.

Next Steps

Check out the Quick Start Guide to build your first evaluation.

Online Evaluation

Online Evaluation automatically evaluates every request to your LLM application in production. Catch quality issues like hallucinations and off-brand responses as they happen.

How It Works

Online Evaluation runs evaluators on your production traces automatically. Monitor quality in real time instead of discovering issues through user complaints.

Key Features

Automatic Evaluation

Every request to your application gets evaluated automatically. The system runs your configured evaluators on each trace as it arrives.

Evaluator Configuration

Configure evaluators like LLM-as-a-Judge with custom prompts tailored to your quality criteria. Use any evaluator that works in regular evaluations.

Span-Level Evaluation

Create online evaluations with filters for specific spans in your traces. Evaluate just the retrieval step in your RAG pipeline or focus on specific tool calls in your agent.

Sampling Control

Set sampling rates to control costs. Evaluate every request during testing, then sample a percentage in production to balance quality monitoring with budget.

Filtering and Analysis

View all evaluated requests in one place. Filter traces by evaluation scores to find problematic cases. Jump into detailed traces to understand what went wrong.

Build Better Test Sets

Add problematic cases directly to your test sets. Turn production failures into regression tests.

Setup

Setting up online evaluation takes a few minutes:

  1. Navigate to the Online Evaluation section
  2. Select the evaluators you want to run
  3. Configure sampling rates and span filters if needed
  4. Enable the online evaluation

Your application traces will be automatically evaluated as they arrive.

Use Cases

Catch hallucinations by running fact-checking evaluators on every response. Monitor brand compliance using LLM-as-a-Judge evaluators with custom prompts. Track RAG quality by evaluating retrieval in real time. Monitor agent reliability by checking tool calls and reasoning steps. Build better test sets by capturing edge cases from production.

Next Steps

Learn about configuring evaluators for your quality criteria.

Customize LLM-as-a-Judge Output Schemas

The LLM-as-a-Judge evaluator now supports custom output schemas. You can define exactly what feedback structure you need for your evaluations.

What's New

Flexible Output Types

Configure the evaluator to return different types of outputs:

  • Binary: Return a simple yes/no or pass/fail score
  • Multiclass: Choose from multiple predefined categories
  • Custom JSON: Define any structure that fits your use case

Include Reasoning for Better Quality

Enable the reasoning option to have the LLM explain its evaluation. This improves prediction quality because the model thinks through its assessment before providing a score.

When you include reasoning, the evaluator returns both the score and a detailed explanation of how it arrived at that judgment.

Advanced: Raw JSON Schema

For complete control, provide a raw JSON schema. The evaluator will return responses that match your exact structure.

This lets you capture multiple scores, categorical labels, confidence levels, and custom fields in a single evaluation pass. You can structure the output however your workflow requires.

Use Custom Schemas in Evaluation

Once configured, your custom schemas work seamlessly in the evaluation workflow. The results display in the evaluation dashboard with all your custom fields visible.

This makes it easy to analyze multiple dimensions of quality in a single evaluation run.

Example Use Cases

Binary Score with Reasoning: Return a simple correct/incorrect judgment along with an explanation of why the output succeeded or failed.

Multi-dimensional Feedback: Capture separate scores for accuracy, relevance, completeness, and tone in one evaluation. Include reasoning for each dimension.

Structured Classification: Return categorical labels (excellent/good/fair/poor) along with specific issues found and suggestions for improvement.

Getting Started

To use custom output schemas with LLM-as-a-Judge:

  1. Open the evaluator configuration
  2. Select your desired output type (binary, multiclass, or custom)
  3. Enable reasoning if you want explanations
  4. For advanced use, provide your JSON schema
  5. Run your evaluation

Learn more in the LLM-as-a-Judge documentation.