Blog - Docs - Agenta

Online Evaluation

November 11, 2025

Online Evaluation automatically evaluates every request to your LLM application in production. Catch quality issues like hallucinations and off-brand responses as they happen.

How It Works

Online Evaluation runs evaluators on your production traces automatically. Monitor quality in real time instead of discovering issues through user complaints.

Key Features

Automatic Evaluation

Every request to your application gets evaluated automatically. The system runs your configured evaluators on each trace as it arrives.

Evaluator Configuration

Configure evaluators like LLM-as-a-Judge with custom prompts tailored to your quality criteria. Use any evaluator that works in regular evaluations.

Span-Level Evaluation

Create online evaluations with filters for specific spans in your traces. Evaluate just the retrieval step in your RAG pipeline or focus on specific tool calls in your agent.

Sampling Control

Set sampling rates to control costs. Evaluate every request during testing, then sample a percentage in production to balance quality monitoring with budget.

Filtering and Analysis

View all evaluated requests in one place. Filter traces by evaluation scores to find problematic cases. Jump into detailed traces to understand what went wrong.

Build Better Test Sets

Add problematic cases directly to your test sets. Turn production failures into regression tests.

Setup

Setting up online evaluation takes a few minutes:

Navigate to the Online Evaluation section
Select the evaluators you want to run
Configure sampling rates and span filters if needed
Enable the online evaluation

Your application traces will be automatically evaluated as they arrive.

Use Cases

Catch hallucinations by running fact-checking evaluators on every response. Monitor brand compliance using LLM-as-a-Judge evaluators with custom prompts. Track RAG quality by evaluating retrieval in real time. Monitor agent reliability by checking tool calls and reasoning steps. Build better test sets by capturing edge cases from production.

Next Steps

Learn about configuring evaluators for your quality criteria.

Customize LLM-as-a-Judge Output Schemas

November 10, 2025

The LLM-as-a-Judge evaluator now supports custom output schemas. You can define exactly what feedback structure you need for your evaluations.

What's New

Flexible Output Types

Configure the evaluator to return different types of outputs:

Binary: Return a simple yes/no or pass/fail score
Multiclass: Choose from multiple predefined categories
Custom JSON: Define any structure that fits your use case

Include Reasoning for Better Quality

Enable the reasoning option to have the LLM explain its evaluation. This improves prediction quality because the model thinks through its assessment before providing a score.

When you include reasoning, the evaluator returns both the score and a detailed explanation of how it arrived at that judgment.

Advanced: Raw JSON Schema

For complete control, provide a raw JSON schema. The evaluator will return responses that match your exact structure.

This lets you capture multiple scores, categorical labels, confidence levels, and custom fields in a single evaluation pass. You can structure the output however your workflow requires.

Use Custom Schemas in Evaluation

Once configured, your custom schemas work seamlessly in the evaluation workflow. The results display in the evaluation dashboard with all your custom fields visible.

This makes it easy to analyze multiple dimensions of quality in a single evaluation run.

Example Use Cases

Binary Score with Reasoning: Return a simple correct/incorrect judgment along with an explanation of why the output succeeded or failed.

Multi-dimensional Feedback: Capture separate scores for accuracy, relevance, completeness, and tone in one evaluation. Include reasoning for each dimension.

Structured Classification: Return categorical labels (excellent/good/fair/poor) along with specific issues found and suggestions for improvement.

Getting Started

To use custom output schemas with LLM-as-a-Judge:

Open the evaluator configuration
Select your desired output type (binary, multiclass, or custom)
Enable reasoning if you want explanations
For advanced use, provide your JSON schema
Run your evaluation

Learn more in the LLM-as-a-Judge documentation.

Documentation Architecture Overhaul

November 3, 2025

We've completely rewritten and restructured our documentation with a new architecture. This is one of the largest updates we've made to the documentation, involving a near-complete rewrite of existing content and adding substantial new material.

Diataxis Framework Implementation

We've reorganized all documentation using the Diataxis framework.

Expanded Observability Documentation

One of the biggest gaps in our previous documentation was observability. We've added comprehensive documentation covering:

JavaScript/TypeScript Support

Documentation now includes JavaScript and TypeScript examples alongside Python wherever applicable. This makes it easier for JavaScript developers to integrate Agenta into their applications.

Ask AI Feature

We've added a new "Ask AI" feature that lets you ask questions directly to the documentation. Get instant answers to your questions without searching through pages.

Vertex AI Provider Support

October 24, 2025

We've added support for Google Cloud's Vertex AI platform. You can now use Gemini models and other Vertex AI partner models directly in Agenta.

What's New

Vertex AI is now available as a provider across the platform:

Playground: Configure and test Gemini models and other Vertex AI models
Model Hub: Add your Vertex AI credentials and manage available models
Gateway: Access Vertex AI models through the InVoke endpoints

You can use any model available through Vertex AI, including:

Gemini models: Google's most capable AI models (gemini-2.5-pro, gemini-2.5-flash, etc.)
Partner models: Claude, Llama, Mistral, and other models available through Vertex AI Model Garden

Configuration

To get started with Vertex AI, go to Settings → Model Hub and add your Vertex AI credentials:

Vertex Project: Your Google Cloud project ID
Vertex Location: The region for your models (e.g., us-central1, europe-west4)
Vertex Credentials: Your service account key in JSON format

For detailed setup instructions, see our documentation on adding custom providers.

Security

All API keys and credentials are encrypted both in transit and at rest, ensuring your sensitive information stays secure.

Filtering Traces by Annotation

October 14, 2025

We rebuilt the filtering system in observability. We added a new dropdown with more options. Additionally, we added a new annotation filtering. You can now filter and search traces based on their annotations. This feature helps you find traces with low scores or bad feedback quickly.

The new dropdown is simpler and gives you more options. You can now filter by:

Span status: Find successful or failed spans
Input keys: Search for specific inputs in your spans
App or environment: Filter traces from specific apps or environments
Any key within your span: Search custom data in your trace structure

Annotation Filtering

Filter traces based on evaluations and feedback:

Evaluator results: Find spans evaluated by a specific evaluator
User feedback: Search for spans with feedback like success=True

This feature enables powerful workflows:

Capture user feedback from your application using our API (see tutorial)
Filter traces to find those with bad feedback or low scores
Add them to test sets to track problematic cases
Improve your prompts based on real user feedback

The filtering system makes it easy to turn production issues into test cases.

New Evaluation Results Dashboard

September 26, 2025

We rebuilt the evaluation results dashboard. Now you can check your results faster and see how well your AI performs.

What's New

Charts and Graphs

We added charts that show your AI's performance. You can quickly spot problems and see patterns in your data.

Compare Results Side by Side

Compare multiple tests at once. See which prompts or models work better. View charts and detailed results together.

Better Results Table

Results now show in a clean table format. It works great for small tests (10 cases) and big tests (10,000+ cases). The page loads fast no matter how much data you have.

Detailed View

Click on any result to see more details. Find out why a test passed or failed. Get the full picture of what happened.

See Your Settings

Check exactly which settings you used for each test. This helps you repeat successful tests and understand your results better.

Name Your Tests

Give your tests names and descriptions. Stay organized and help your team understand what each test does.

Deep URL Support for Sharable Links

September 24, 2025

URLs across Agenta now include workspace context, making them fully shareable between team members. This was a highly requested feature that addresses several critical issues with the previous URL structure.

What Changed

Before

URLs did not include workspace information
Sharing links between team members would redirect to the recipient's default workspace
Page refreshes would sometimes lose context and revert to the default workspace
Deep linking to specific resources was unreliable

Now

All URLs include the workspace context in the URL path
Links shared between team members work correctly, maintaining the intended workspace
Page refreshes maintain the correct workspace context
Deep linking works reliably for all resources

What You Can Deep Link

You can now create shareable deep links to almost any resource in Agenta:

Prompts: Share direct links to specific prompts in any workspace
Evaluations: Link directly to evaluation results and configurations
Test Sets: Share test sets with team members
Playground Sessions: Link to specific playground configurations

Speed Improvements in the Playground

September 19, 2025

We rewrote most of Agenta's frontend. You'll see much faster speeds when you create prompts or use the playground.

We also made many improvements and fixed bugs:

LLM-as-a-judge now uses double curly braces {{}} instead of single curly braces { and }. This matches how normal prompts work. Old LLM-as-a-judge prompts with single curly braces still work. We updated the LLM-as-a-judge playground to make editing prompts easier.
You can now use an external Redis instance for caching by setting it as an environment variable
Fixed the custom workflow quick start tutorial and examples
Fixed SDK compatibility issues with Python 3.9
Fixed default filtering in observability dashboard
Fixed error handling in the evaluator playground
Fixed the Tracing SDK to allow instrumenting streaming responses and overriding OTEL environment variables

Multiple Metrics in Human Evaluation

September 9, 2025

We spent the past months rethinking how evaluation should work. Today we're announcing one of the first big improvements.

The fastest teams building LLM apps were using human evaluation to check their outputs before going live. Agenta was helping them do this in minutes.

But we also saw that they were limited. You could only score the outputs with one metric.

That's why we rebuilt the human evaluation workflow.

Now you can set multiple evaluators and metrics and use them to score the outputs. This lets you evaluate the same output on different metrics like relevance or completeness. You can also create binary, numerical scores, or even use strings for comments or expected answer.

This unlocks a whole new set of use cases:

Compare your prompts on multiple metrics and understand where you can improve.
Turn your annotations into test sets and use them in prompt engineering. For instance, you can add comments that help you later in improve your prompts.
Use human evaluation to bootstrap automatic evaluation. You can annotate your outputs with the expected answer or a rubic, then use it to set up an automatic evaluation.

Watch the video below and read the post for more details. Or check out the docs to learn how to use the new human evaluation workflow.

Major Playground Improvements and Enhancements

August 7, 2025

We've made lots of improvements to the playground. Here are some of the highlights:

JSON Editor Improvements

Enhanced Error Display and Editing

The JSON editor now provides clearer error messages and improved editing functionality. We've fixed issues with error display that previously made it difficult to debug JSON configuration problems.

Undo Support with Ctrl+Z

You can now use Ctrl+Z (or Cmd+Z on Mac) to undo changes in the JSON editor, making it much easier to iterate on complex JSON configurations without fear of losing your work.

Bug Fix: JSON Field Order Preservation

The structured output JSON field order is now preserved throughout the system. This is crucial when working with LLMs that are sensitive to the ordering of JSON fields in their responses.

Previously, JSON objects might have their field order changed during processing, which could affect LLM behavior and evaluation consistency. Now, the exact order you define is maintained across all operations.

Playground Improvements

Dynamic variables

We've improved the editor behavior with dynamic variables in the prompt.

Markdown and Text View Toggle

You can now switch between markdown and text view for messages.

Collapsible Interface Elements

We've added the ability to collapse various sections of the playground interface, helping you focus on what matters most for your current task.

Collapsible Test Cases for Large Sets

When loading large test sets, you can now collapse individual test cases to better manage the interface.

Visual diff when committing changes

The playground now shows a visual diff when you're committing changes, making it easy to review exactly what modifications you're about to save.

How It Works​

Key Features​

Automatic Evaluation​

Evaluator Configuration​

Span-Level Evaluation​

Sampling Control​

Filtering and Analysis​

Build Better Test Sets​

Setup​

Use Cases​

Next Steps​

What's New​

Flexible Output Types​

Include Reasoning for Better Quality​

Advanced: Raw JSON Schema​

Use Custom Schemas in Evaluation​

Example Use Cases​

Getting Started​

Diataxis Framework Implementation​

Expanded Observability Documentation​

JavaScript/TypeScript Support​

Ask AI Feature​

What's New​

Configuration​

Security​

New Filter Options​

Annotation Filtering​

What's New​

Charts and Graphs​

Compare Results Side by Side​

Better Results Table​

Detailed View​

See Your Settings​

Name Your Tests​

What Changed​

Before​

Now​

What You Can Deep Link​

JSON Editor Improvements​

Playground Improvements​

How It Works

Key Features

Automatic Evaluation

Evaluator Configuration

Span-Level Evaluation

Sampling Control

Filtering and Analysis

Build Better Test Sets

Setup

Use Cases

Next Steps

What's New

Flexible Output Types

Include Reasoning for Better Quality

Advanced: Raw JSON Schema

Use Custom Schemas in Evaluation

Example Use Cases

Getting Started

Diataxis Framework Implementation

Expanded Observability Documentation

JavaScript/TypeScript Support

Ask AI Feature

What's New

Configuration

Security

New Filter Options

Annotation Filtering

What's New

Charts and Graphs

Compare Results Side by Side

Better Results Table

Detailed View

See Your Settings

Name Your Tests

What Changed

Before

Now

What You Can Deep Link

JSON Editor Improvements

Playground Improvements