Launch Week #2 Day 3: Evaluation SDK

Launch Week #2 Day 3: Evaluation SDK

Launch Week #2 Day 3: Evaluation SDK

Nov 12, 2025

Nov 12, 2025

-

Ship reliable agents faster

Build reliable LLM apps together with integrated prompt management, evaluation, and observability.

Most LLM apps today are complex agents with multiple steps, tool calls, and retrieval flows.

It's hard to evaluate this complexity from the UI alone. You need programmatic control over what you test and how you measure it.

That's why we built the Evaluation SDK.

Evaluate Complex Agents Programmatically

The Evaluation SDK gives you full control over evaluating your AI agents.

Create or fetch test sets. You can create test sets directly in code or fetch existing test sets from Agenta. Test sets can include ground truth data or not, depending on your evaluators.

Write custom evaluators. Build evaluators in code that match your exact requirements. Check for specific patterns, validate outputs against your rules, or measure domain-specific quality metrics.

Use built-in evaluators. We provide a library of evaluators including LLM-as-a-judge, semantic similarity, regex matching, and code-based validators. Save evaluator configurations in Agenta to reuse them across runs.

Evaluate end to end or specific steps. Test your entire agent flow or measure specific spans. You can evaluate retrieval quality, tool call accuracy, or any individual step in your agent's execution.

View results in the dashboard. Run evaluations from code and get a link to view results in Agenta. You can see the full traces, compare runs, and share results with your team.

Works With Any Framework

The SDK works with any agent framework. OpenAI Agents SDK, LangGraph, LlamaIndex, or custom implementations all work the same way.

You instrument your agent once. Then you can evaluate it however you need.

How It Works

Here's how you use the SDK:

First, you create or fetch your test set. You can define test cases in code or pull an existing test set from Agenta.

Then you set up evaluators. Write custom evaluators in Python or configure built-in ones like LLM-as-a-judge or semantic similarity.

You can evaluate specific spans in your agent's execution. For example, check the quality of a retrieval step before the agent generates its final response.

Finally, run the evaluation. The SDK executes your agent on each test case, applies the evaluators, and generates results.

You get a link to the Agenta dashboard. There you can see the overview, inspect individual traces, and compare different runs.

Get Started

The Evaluation SDK is available now.

Read the documentation

Try it in Agenta Cloud

This is day 3 of our launch week. Tomorrow we're announcing the most exciting feature yet.

Most LLM apps today are complex agents with multiple steps, tool calls, and retrieval flows.

It's hard to evaluate this complexity from the UI alone. You need programmatic control over what you test and how you measure it.

That's why we built the Evaluation SDK.

Evaluate Complex Agents Programmatically

The Evaluation SDK gives you full control over evaluating your AI agents.

Create or fetch test sets. You can create test sets directly in code or fetch existing test sets from Agenta. Test sets can include ground truth data or not, depending on your evaluators.

Write custom evaluators. Build evaluators in code that match your exact requirements. Check for specific patterns, validate outputs against your rules, or measure domain-specific quality metrics.

Use built-in evaluators. We provide a library of evaluators including LLM-as-a-judge, semantic similarity, regex matching, and code-based validators. Save evaluator configurations in Agenta to reuse them across runs.

Evaluate end to end or specific steps. Test your entire agent flow or measure specific spans. You can evaluate retrieval quality, tool call accuracy, or any individual step in your agent's execution.

View results in the dashboard. Run evaluations from code and get a link to view results in Agenta. You can see the full traces, compare runs, and share results with your team.

Works With Any Framework

The SDK works with any agent framework. OpenAI Agents SDK, LangGraph, LlamaIndex, or custom implementations all work the same way.

You instrument your agent once. Then you can evaluate it however you need.

How It Works

Here's how you use the SDK:

First, you create or fetch your test set. You can define test cases in code or pull an existing test set from Agenta.

Then you set up evaluators. Write custom evaluators in Python or configure built-in ones like LLM-as-a-judge or semantic similarity.

You can evaluate specific spans in your agent's execution. For example, check the quality of a retrieval step before the agent generates its final response.

Finally, run the evaluation. The SDK executes your agent on each test case, applies the evaluators, and generates results.

You get a link to the Agenta dashboard. There you can see the overview, inspect individual traces, and compare different runs.

Get Started

The Evaluation SDK is available now.

Read the documentation

Try it in Agenta Cloud

This is day 3 of our launch week. Tomorrow we're announcing the most exciting feature yet.

Most LLM apps today are complex agents with multiple steps, tool calls, and retrieval flows.

It's hard to evaluate this complexity from the UI alone. You need programmatic control over what you test and how you measure it.

That's why we built the Evaluation SDK.

Evaluate Complex Agents Programmatically

The Evaluation SDK gives you full control over evaluating your AI agents.

Create or fetch test sets. You can create test sets directly in code or fetch existing test sets from Agenta. Test sets can include ground truth data or not, depending on your evaluators.

Write custom evaluators. Build evaluators in code that match your exact requirements. Check for specific patterns, validate outputs against your rules, or measure domain-specific quality metrics.

Use built-in evaluators. We provide a library of evaluators including LLM-as-a-judge, semantic similarity, regex matching, and code-based validators. Save evaluator configurations in Agenta to reuse them across runs.

Evaluate end to end or specific steps. Test your entire agent flow or measure specific spans. You can evaluate retrieval quality, tool call accuracy, or any individual step in your agent's execution.

View results in the dashboard. Run evaluations from code and get a link to view results in Agenta. You can see the full traces, compare runs, and share results with your team.

Works With Any Framework

The SDK works with any agent framework. OpenAI Agents SDK, LangGraph, LlamaIndex, or custom implementations all work the same way.

You instrument your agent once. Then you can evaluate it however you need.

How It Works

Here's how you use the SDK:

First, you create or fetch your test set. You can define test cases in code or pull an existing test set from Agenta.

Then you set up evaluators. Write custom evaluators in Python or configure built-in ones like LLM-as-a-judge or semantic similarity.

You can evaluate specific spans in your agent's execution. For example, check the quality of a retrieval step before the agent generates its final response.

Finally, run the evaluation. The SDK executes your agent on each test case, applies the evaluators, and generates results.

You get a link to the Agenta dashboard. There you can see the overview, inspect individual traces, and compare different runs.

Get Started

The Evaluation SDK is available now.

Read the documentation

Try it in Agenta Cloud

This is day 3 of our launch week. Tomorrow we're announcing the most exciting feature yet.

Co-Founder Agenta & LLM Engineering Expert

Ship reliable agents faster with Agenta

Build reliable LLM apps together with integrated prompt
management, evaluation, and observability.

Ship reliable agents faster with Agenta

Build reliable LLM apps together with integrated prompt
management, evaluation, and observability.

Ship reliable agents faster with Agenta

Build reliable LLM apps together with integrated prompt
management, evaluation, and observability.