Skip to main content

Add Spans to Test Sets


This release introduces the ability to add spans to test sets, making it easier to bootstrap your evaluation data from production. The new feature lets you:

  • Add individual or batch spans to test sets
  • Create custom mappings between spans and test sets
  • Preview test set changes before committing them

Additional improvements:

  • Fixed CSV test set upload issues
  • Prevented viewing of incomplete evaluations
  • Added mobile compatibility warning
  • Added support for custom ports in self-hosted installations

Viewing Traces in the Playground and Authentication for Deployed Applications

Viewing traces in the playground:

You can now see traces directly in the playground. For simple applications, this means you can view the prompts sent to LLMs. For custom workflows, you get an overview of intermediate steps and outputs. This makes it easier to understand what’s happening under the hood and debug your applications.

Authentication improvements:

We’ve strengthened authentication for deployed applications. As you know, Agenta lets you either fetch the app’s config or call it with Agenta acting as a proxy. Now, we’ve added authentication to the second method. The APIs we create are now protected and can be called using an API key. You can find code snippets for calling the application in the overview page.

Documentation improvements:

We’ve added new cookbooks and updated existing documentation:

Bug fixes:

  • Fixed an issue with the observability SDK not being compatible with LiteLLM.
  • Fixed an issue where cost and token usage were not correctly computed for all calls.

Observability and Prompt Management

This release is one of our biggest yet—one changelog hardly does it justice.

First up: Observability

We’ve had observability in beta for a while, but now it’s been completely rewritten, with a brand-new UI and fully open-source code.

The new Observability SDK is compatible with OpenTelemetry (Otel) and gen-ai semantic conventions. This means you get a lot of integrations right out of the box, like LangChain, OpenAI, and more.

We’ll publish a full blog post soon, but here’s a quick look at what the new observability offers:

  • A redesigned UI that lets you visualize nested traces, making it easier to understand what’s happening behind the scenes.

  • The web UI lets you filter traces by name, cost, and other attributes—you can even search through them easily.

  • The SDK is Otel-compatible, and we’ve already tested integrations for OpenAI, LangChain, LiteLLM, and Instructor, with guides available for each. In most cases, adding a few lines of code will have you seeing traces directly in Agenta.

Next: Prompt Management

We’ve completely rewritten the prompt management SDK, giving you full CRUD capabilities for prompts and configurations. This includes creating, updating, reading history, deploying new versions, and deleting old ones. You can find a first tutorial for this here.

And finally: LLM-as-a-Judge Overhaul

We've made significant upgrades to the LLM-as-a-Judge evaluator. It now supports prompts with multiple messages and has access to all variables in a test case. You can also switch models (currently supporting OpenAI and Anthropic). These changes make the evaluator much more flexible, and we're seeing better results with it.

Configuring the LLM-as-a-Judge evaluator

New Application Management View and Various Improvements

We updated the Application Management View to improve the UI. Many users struggled to find their applications when they had a large number, so we've improved the view and added a search bar for quick filtering. Additionally, we are moving towards a new project structure for the application. We moved test sets and evaluators outside of the application scope. So now, you can use the same test set and evaluators in multiple applications.

Bug Fixes

  • Added an export button in the evaluation view to export results from the main view.
  • Eliminated Pydantic warnings in the CLI.
  • Improved error messages when fetch_config is called with wrong arguments.
  • Enhanced the custom code evaluation sandbox and removed the limitation that results need to be between 0 and 1

Evaluator Testing Playground and a New Evaluation View


Many users faced challenges configuring evaluators in the web UI. Some evaluators, such as LLM as a judge, custom code, or RAG evaluators can be tricky to set up correctly on the first try. Until now, users needed to setup, run an evaluation, check the errors, then do it again.

To address this, we've introduced a new evaluator test/debug playground. This feature allows you to test the evaluator live on real data, helping you test the configuration before committing to it and using it for evaluations.

Additionally, we have improved and redesigned the evaluation view. Both automatic and human evaluations are now within the same view but in different tabs. We're moving towards unifying all evaluator results and consolidating them in one view, allowing you to quickly get an overview of what's working.


UI Redesign and Configuration Management and Overview View

We've completely redesigned the platform's UI. Additionally we have introduced a new overview view for your applications. This is part of a series of upcoming improvements slated for the next few weeks.

The new overview view offers:

  • A dashboard displaying key metrics of your application
  • A table with all the variants of your applications
  • A summary of your application's most recent evaluations

We've also added a new JSON Diff evaluator. This evaluator compares two JSON objects and provides a similarity score.

Lastly, we've updated the UI of our documentation.


New Alpha Version of the SDK for Creating Custom Applications

We've released a new version of the SDK for creating custom applications. This Pydantic-based SDK significantly simplifies the process of building custom applications. It's fully backward compatible, so your existing code will continue to work seamlessly. We'll soon be rolling out comprehensive documentation and examples for the new SDK.

In the meantime, here's a quick example of how to use it:

import agenta as ag
from agenta import Agenta
from pydantic import BaseModel, Field

ag.init()

# Define the configuration of the application (that will be shown in the playground )
class MyConfig(BaseModel):
temperature: float = Field(default=0.2)
prompt_template: str = Field(default="What is the capital of {country}?")

# Creates an endpoint for the entrypoint of the application
@ag.route("/", config_schema=MyConfig)
def generate(country: str) -> str:
# Fetch the config from the request
config: MyConfig = ag.ConfigManager.get_from_route(schema=MyConfig)
prompt = config.prompt_template.format(country=country)
chat_completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=config.temperature,
)
return chat_completion.choices[0].message.content


RAGAS Evaluators and Traces in the Playground

We're excited to announce two major features this week:

  1. We've integrated RAGAS evaluators into agenta. Two new evaluators have been added: RAG Faithfulness (measuring how consistent the LLM output is with the context) and Context Relevancy (assessing how relevant the retrieved context is to the question). Both evaluators use intermediate outputs within the trace to calculate the final score.

    Check out the tutorial to learn how to use RAG evaluators.

  1. You can now view traces directly in the playground. This feature enables you to debug your application while configuring it—for example, by examining the prompts sent to the LLM or reviewing intermediate outputs.

note

Both features are available exclusively in the cloud and enterprise versions of agenta.


Migration from MongoDB to Postgres

We have migrated the Agenta database from MongoDB to Postgres. As a result, the platform is much more faster (up to 10x in some use cases).

However, if you are self-hosting agenta, note that this is a breaking change that requires you to manually migrate your data from MongoDB to Postgres.

If you are using the cloud version of Agenta, there is nothing you need to do (other than enjoying the new performance improvements).


More Reliable Evaluations

We have worked extensively on improving the reliability of evaluations. Specifically:

  • We improved the status for evaluations and added a new Queued status
  • We improved the error handling in evaluations. Now we show the exact error message that caused the evaluation to fail.
  • We fixed issues that caused evaluations to run infinitely
  • We fixed issues in the calculation of scores in human evaluations.
  • We fixed small UI issues with large output in human evaluations.
  • We have added a new export button in the evaluation view to export the results as a CSV file.

In observability:

  • We have added a new integration with Litellm to automatically trace all LLM calls done through it.
  • Now we automatically propagate cost and token usage from spans to traces.