Skip to main content

Quality of life improvements

New collapsible side menu

Small release today with quality of life improvements, while we're preparing the huge release coming up in the next days:

  • Added a collapsible side menu for better space management
  • Enhanced frontend performance and responsiveness
  • Implemented a confirmation modal when deleting test sets
  • Improved permission handling across the platform
  • Improved frontend test coverage

Agenta is SOC 2 Type 1 Certified

We've achieved SOC 2 Type 1 certification, validating our security controls for protecting sensitive LLM development data. This certification covers our entire platform, including prompt management, evaluation frameworks, and observability tools.

Key security features and improvements:

  • Data encryption in transit and at rest
  • Enhanced access control and authentication
  • Comprehensive security monitoring
  • Regular third-party security assessments
  • Backup and disaster recovery protocols

This certification represents a significant milestone for teams using Agenta in production environments. Whether you're using our open-source platform or cloud offering, you can now build LLM applications with enterprise-grade security confidence.

We've also updated our trust center with detailed information about our security practices and compliance standards. For teams interested in learning more about our security controls or requesting our SOC 2 report, please contact [email protected].


New Onboarding Flow

We've redesigned our platform's onboarding to make getting started simpler and more intuitive. Key improvements include:

  • Streamlined tracing setup process
  • Added a demo RAG playground project showcasing custom workflows
  • Enhanced frontend performance
  • Fixed scroll behavior in trace view

Add Spans to Test Sets


This release introduces the ability to add spans to test sets, making it easier to bootstrap your evaluation data from production. The new feature lets you:

  • Add individual or batch spans to test sets
  • Create custom mappings between spans and test sets
  • Preview test set changes before committing them

Additional improvements:

  • Fixed CSV test set upload issues
  • Prevented viewing of incomplete evaluations
  • Added mobile compatibility warning
  • Added support for custom ports in self-hosted installations

Viewing Traces in the Playground and Authentication for Deployed Applications

Viewing traces in the playground:

You can now see traces directly in the playground. For simple applications, this means you can view the prompts sent to LLMs. For custom workflows, you get an overview of intermediate steps and outputs. This makes it easier to understand what’s happening under the hood and debug your applications.

Authentication improvements:

We’ve strengthened authentication for deployed applications. As you know, Agenta lets you either fetch the app’s config or call it with Agenta acting as a proxy. Now, we’ve added authentication to the second method. The APIs we create are now protected and can be called using an API key. You can find code snippets for calling the application in the overview page.

Documentation improvements:

We’ve added new cookbooks and updated existing documentation:

Bug fixes:

  • Fixed an issue with the observability SDK not being compatible with LiteLLM.
  • Fixed an issue where cost and token usage were not correctly computed for all calls.

Observability and Prompt Management

This release is one of our biggest yet—one changelog hardly does it justice.

First up: Observability

We’ve had observability in beta for a while, but now it’s been completely rewritten, with a brand-new UI and fully open-source code.

The new Observability SDK is compatible with OpenTelemetry (Otel) and gen-ai semantic conventions. This means you get a lot of integrations right out of the box, like LangChain, OpenAI, and more.

We’ll publish a full blog post soon, but here’s a quick look at what the new observability offers:

  • A redesigned UI that lets you visualize nested traces, making it easier to understand what’s happening behind the scenes.

  • The web UI lets you filter traces by name, cost, and other attributes—you can even search through them easily.

  • The SDK is Otel-compatible, and we’ve already tested integrations for OpenAI, LangChain, LiteLLM, and Instructor, with guides available for each. In most cases, adding a few lines of code will have you seeing traces directly in Agenta.

Next: Prompt Management

We’ve completely rewritten the prompt management SDK, giving you full CRUD capabilities for prompts and configurations. This includes creating, updating, reading history, deploying new versions, and deleting old ones. You can find a first tutorial for this here.

And finally: LLM-as-a-Judge Overhaul

We've made significant upgrades to the LLM-as-a-Judge evaluator. It now supports prompts with multiple messages and has access to all variables in a test case. You can also switch models (currently supporting OpenAI and Anthropic). These changes make the evaluator much more flexible, and we're seeing better results with it.

Configuring the LLM-as-a-Judge evaluator

New Application Management View and Various Improvements

We updated the Application Management View to improve the UI. Many users struggled to find their applications when they had a large number, so we've improved the view and added a search bar for quick filtering. Additionally, we are moving towards a new project structure for the application. We moved test sets and evaluators outside of the application scope. So now, you can use the same test set and evaluators in multiple applications.

Bug Fixes

  • Added an export button in the evaluation view to export results from the main view.
  • Eliminated Pydantic warnings in the CLI.
  • Improved error messages when fetch_config is called with wrong arguments.
  • Enhanced the custom code evaluation sandbox and removed the limitation that results need to be between 0 and 1

Evaluator Testing Playground and a New Evaluation View


Many users faced challenges configuring evaluators in the web UI. Some evaluators, such as LLM as a judge, custom code, or RAG evaluators can be tricky to set up correctly on the first try. Until now, users needed to setup, run an evaluation, check the errors, then do it again.

To address this, we've introduced a new evaluator test/debug playground. This feature allows you to test the evaluator live on real data, helping you test the configuration before committing to it and using it for evaluations.

Additionally, we have improved and redesigned the evaluation view. Both automatic and human evaluations are now within the same view but in different tabs. We're moving towards unifying all evaluator results and consolidating them in one view, allowing you to quickly get an overview of what's working.


UI Redesign and Configuration Management and Overview View

We've completely redesigned the platform's UI. Additionally we have introduced a new overview view for your applications. This is part of a series of upcoming improvements slated for the next few weeks.

The new overview view offers:

  • A dashboard displaying key metrics of your application
  • A table with all the variants of your applications
  • A summary of your application's most recent evaluations

We've also added a new JSON Diff evaluator. This evaluator compares two JSON objects and provides a similarity score.

Lastly, we've updated the UI of our documentation.


New Alpha Version of the SDK for Creating Custom Applications

We've released a new version of the SDK for creating custom applications. This Pydantic-based SDK significantly simplifies the process of building custom applications. It's fully backward compatible, so your existing code will continue to work seamlessly. We'll soon be rolling out comprehensive documentation and examples for the new SDK.

In the meantime, here's a quick example of how to use it:

import agenta as ag
from agenta import Agenta
from pydantic import BaseModel, Field

ag.init()

# Define the configuration of the application (that will be shown in the playground )
class MyConfig(BaseModel):
temperature: float = Field(default=0.2)
prompt_template: str = Field(default="What is the capital of {country}?")

# Creates an endpoint for the entrypoint of the application
@ag.route("/", config_schema=MyConfig)
def generate(country: str) -> str:
# Fetch the config from the request
config: MyConfig = ag.ConfigManager.get_from_route(schema=MyConfig)
prompt = config.prompt_template.format(country=country)
chat_completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=config.temperature,
)
return chat_completion.choices[0].message.content