Launch Week #2 Day 3: Evaluation SDK
Launch Week #2 Day 3: Evaluation SDK
Launch Week #2 Day 3: Evaluation SDK
Nov 12, 2025
Nov 12, 2025
-



Ship reliable agents faster
Build reliable LLM apps together with integrated prompt management, evaluation, and observability.
Most LLM apps today are complex agents with multiple steps, tool calls, and retrieval flows.
It's hard to evaluate this complexity from the UI alone. You need programmatic control over what you test and how you measure it.
That's why we built the Evaluation SDK.
Evaluate Complex Agents Programmatically
The Evaluation SDK gives you full control over evaluating your AI agents.
Create or fetch test sets. You can create test sets directly in code or fetch existing test sets from Agenta. Test sets can include ground truth data or not, depending on your evaluators.
Write custom evaluators. Build evaluators in code that match your exact requirements. Check for specific patterns, validate outputs against your rules, or measure domain-specific quality metrics.
Use built-in evaluators. We provide a library of evaluators including LLM-as-a-judge, semantic similarity, regex matching, and code-based validators. Save evaluator configurations in Agenta to reuse them across runs.
Evaluate end to end or specific steps. Test your entire agent flow or measure specific spans. You can evaluate retrieval quality, tool call accuracy, or any individual step in your agent's execution.
View results in the dashboard. Run evaluations from code and get a link to view results in Agenta. You can see the full traces, compare runs, and share results with your team.
Works With Any Framework
The SDK works with any agent framework. OpenAI Agents SDK, LangGraph, LlamaIndex, or custom implementations all work the same way.
You instrument your agent once. Then you can evaluate it however you need.
How It Works
Here's how you use the SDK:
First, you create or fetch your test set. You can define test cases in code or pull an existing test set from Agenta.
Then you set up evaluators. Write custom evaluators in Python or configure built-in ones like LLM-as-a-judge or semantic similarity.
You can evaluate specific spans in your agent's execution. For example, check the quality of a retrieval step before the agent generates its final response.
Finally, run the evaluation. The SDK executes your agent on each test case, applies the evaluators, and generates results.
You get a link to the Agenta dashboard. There you can see the overview, inspect individual traces, and compare different runs.
Get Started
The Evaluation SDK is available now.
This is day 3 of our launch week. Tomorrow we're announcing the most exciting feature yet.
Most LLM apps today are complex agents with multiple steps, tool calls, and retrieval flows.
It's hard to evaluate this complexity from the UI alone. You need programmatic control over what you test and how you measure it.
That's why we built the Evaluation SDK.
Evaluate Complex Agents Programmatically
The Evaluation SDK gives you full control over evaluating your AI agents.
Create or fetch test sets. You can create test sets directly in code or fetch existing test sets from Agenta. Test sets can include ground truth data or not, depending on your evaluators.
Write custom evaluators. Build evaluators in code that match your exact requirements. Check for specific patterns, validate outputs against your rules, or measure domain-specific quality metrics.
Use built-in evaluators. We provide a library of evaluators including LLM-as-a-judge, semantic similarity, regex matching, and code-based validators. Save evaluator configurations in Agenta to reuse them across runs.
Evaluate end to end or specific steps. Test your entire agent flow or measure specific spans. You can evaluate retrieval quality, tool call accuracy, or any individual step in your agent's execution.
View results in the dashboard. Run evaluations from code and get a link to view results in Agenta. You can see the full traces, compare runs, and share results with your team.
Works With Any Framework
The SDK works with any agent framework. OpenAI Agents SDK, LangGraph, LlamaIndex, or custom implementations all work the same way.
You instrument your agent once. Then you can evaluate it however you need.
How It Works
Here's how you use the SDK:
First, you create or fetch your test set. You can define test cases in code or pull an existing test set from Agenta.
Then you set up evaluators. Write custom evaluators in Python or configure built-in ones like LLM-as-a-judge or semantic similarity.
You can evaluate specific spans in your agent's execution. For example, check the quality of a retrieval step before the agent generates its final response.
Finally, run the evaluation. The SDK executes your agent on each test case, applies the evaluators, and generates results.
You get a link to the Agenta dashboard. There you can see the overview, inspect individual traces, and compare different runs.
Get Started
The Evaluation SDK is available now.
This is day 3 of our launch week. Tomorrow we're announcing the most exciting feature yet.
Most LLM apps today are complex agents with multiple steps, tool calls, and retrieval flows.
It's hard to evaluate this complexity from the UI alone. You need programmatic control over what you test and how you measure it.
That's why we built the Evaluation SDK.
Evaluate Complex Agents Programmatically
The Evaluation SDK gives you full control over evaluating your AI agents.
Create or fetch test sets. You can create test sets directly in code or fetch existing test sets from Agenta. Test sets can include ground truth data or not, depending on your evaluators.
Write custom evaluators. Build evaluators in code that match your exact requirements. Check for specific patterns, validate outputs against your rules, or measure domain-specific quality metrics.
Use built-in evaluators. We provide a library of evaluators including LLM-as-a-judge, semantic similarity, regex matching, and code-based validators. Save evaluator configurations in Agenta to reuse them across runs.
Evaluate end to end or specific steps. Test your entire agent flow or measure specific spans. You can evaluate retrieval quality, tool call accuracy, or any individual step in your agent's execution.
View results in the dashboard. Run evaluations from code and get a link to view results in Agenta. You can see the full traces, compare runs, and share results with your team.
Works With Any Framework
The SDK works with any agent framework. OpenAI Agents SDK, LangGraph, LlamaIndex, or custom implementations all work the same way.
You instrument your agent once. Then you can evaluate it however you need.
How It Works
Here's how you use the SDK:
First, you create or fetch your test set. You can define test cases in code or pull an existing test set from Agenta.
Then you set up evaluators. Write custom evaluators in Python or configure built-in ones like LLM-as-a-judge or semantic similarity.
You can evaluate specific spans in your agent's execution. For example, check the quality of a retrieval step before the agent generates its final response.
Finally, run the evaluation. The SDK executes your agent on each test case, applies the evaluators, and generates results.
You get a link to the Agenta dashboard. There you can see the overview, inspect individual traces, and compare different runs.
Get Started
The Evaluation SDK is available now.
This is day 3 of our launch week. Tomorrow we're announcing the most exciting feature yet.
More from the Blog
More from the Blog
More from the Blog
The latest updates and insights from Agenta
The latest updates and insights from Agenta
The latest updates and insights from Agenta
Ship reliable agents faster with Agenta
Build reliable LLM apps together with integrated prompt
management, evaluation, and observability.

Ship reliable agents faster with Agenta
Build reliable LLM apps together with integrated prompt
management, evaluation, and observability.
Ship reliable agents faster with Agenta
Build reliable LLM apps together with integrated prompt
management, evaluation, and observability.

Copyright © 2020 - 2060 Agentatech UG
Copyright © 2020 - 2060 Agentatech UG
Copyright © 2020 - 2060 Agentatech UG




