Blog - Docs - Agenta

New Playground

February 4, 2025

We've rebuilt our playground from scratch to make prompt engineering faster and more intuitive. The old playground took 20 seconds to create a prompt - now it's instant.

Key improvements:

Create prompts with multiple messages using our new template system
Format variables easily with curly bracket syntax and a built-in validator
Switch between chat and completion prompts in one interface
Load test sets directly in the playground to iterate faster
Save successful outputs as test cases with one click
Compare different prompts side-by-side
Deploy changes straight to production

For developers, now you create prompts programmatically through our API.

You can explore these features in our updated playground documentation.

Quality of life improvements

January 27, 2025

Small release today with quality of life improvements, while we're preparing the huge release coming up in the next days:

Added a collapsible side menu for better space management
Enhanced frontend performance and responsiveness
Implemented a confirmation modal when deleting test sets
Improved permission handling across the platform
Improved frontend test coverage

Agenta is SOC 2 Type 1 Certified

January 15, 2025

We've achieved SOC 2 Type 1 certification, validating our security controls for protecting sensitive LLM development data. This certification covers our entire platform, including prompt management, evaluation frameworks, and observability tools.

Key security features and improvements:

Data encryption in transit and at rest
Enhanced access control and authentication
Comprehensive security monitoring
Regular third-party security assessments
Backup and disaster recovery protocols

This certification represents a significant milestone for teams using Agenta in production environments. Whether you're using our open-source platform or cloud offering, you can now build LLM applications with enterprise-grade security confidence.

We've also updated our trust center with detailed information about our security practices and compliance standards. For teams interested in learning more about our security controls or requesting our SOC 2 report, please contact [email protected].

New Onboarding Flow

January 4, 2025

We've redesigned our platform's onboarding to make getting started simpler and more intuitive. Key improvements include:

Streamlined tracing setup process
Added a demo RAG playground project showcasing custom workflows
Enhanced frontend performance
Fixed scroll behavior in trace view

Add Spans to Test Sets

December 11, 2024

This release introduces the ability to add spans to test sets, making it easier to bootstrap your evaluation data from production. The new feature lets you:

Add individual or batch spans to test sets
Create custom mappings between spans and test sets
Preview test set changes before committing them

Additional improvements:

Fixed CSV test set upload issues
Prevented viewing of incomplete evaluations
Added mobile compatibility warning
Added support for custom ports in self-hosted installations

Viewing Traces in the Playground and Authentication for Deployed Applications

November 29, 2024

Viewing traces in the playground:

You can now see traces directly in the playground. For simple applications, this means you can view the prompts sent to LLMs. For custom workflows, you get an overview of intermediate steps and outputs. This makes it easier to understand what’s happening under the hood and debug your applications.

Authentication improvements:

We’ve strengthened authentication for deployed applications. As you know, Agenta lets you either fetch the app’s config or call it with Agenta acting as a proxy. Now, we’ve added authentication to the second method. The APIs we create are now protected and can be called using an API key. You can find code snippets for calling the application in the overview page.

Documentation improvements:

We’ve added new cookbooks and updated existing documentation:

New cookbook for observability with LangChain
Updated the custom workflows documentation and added reference
Updated the reference for the observability SDK and for the prompt management SDK

Bug fixes:

Fixed an issue with the observability SDK not being compatible with LiteLLM.
Fixed an issue where cost and token usage were not correctly computed for all calls.

Observability and Prompt Management

November 6, 2024

This release is one of our biggest yet—one changelog hardly does it justice.

First up: Observability

We’ve had observability in beta for a while, but now it’s been completely rewritten, with a brand-new UI and fully open-source code.

The new Observability SDK is compatible with OpenTelemetry (Otel) and gen-ai semantic conventions. This means you get a lot of integrations right out of the box, like LangChain, OpenAI, and more.

We’ll publish a full blog post soon, but here’s a quick look at what the new observability offers:

A redesigned UI that lets you visualize nested traces, making it easier to understand what’s happening behind the scenes.
The web UI lets you filter traces by name, cost, and other attributes—you can even search through them easily.
The SDK is Otel-compatible, and we’ve already tested integrations for OpenAI, LangChain, LiteLLM, and Instructor, with guides available for each. In most cases, adding a few lines of code will have you seeing traces directly in Agenta.

Next: Prompt Management

We’ve completely rewritten the prompt management SDK, giving you full CRUD capabilities for prompts and configurations. This includes creating, updating, reading history, deploying new versions, and deleting old ones. You can find a first tutorial for this here.

And finally: LLM-as-a-Judge Overhaul

We've made significant upgrades to the LLM-as-a-Judge evaluator. It now supports prompts with multiple messages and has access to all variables in a test case. You can also switch models (currently supporting OpenAI and Anthropic). These changes make the evaluator much more flexible, and we're seeing better results with it.

Configuring the LLM-as-a-Judge evaluator

New Application Management View and Various Improvements

October 22, 2024

We updated the Application Management View to improve the UI. Many users struggled to find their applications when they had a large number, so we've improved the view and added a search bar for quick filtering. Additionally, we are moving towards a new project structure for the application. We moved test sets and evaluators outside of the application scope. So now, you can use the same test set and evaluators in multiple applications.

Bug Fixes

Added an export button in the evaluation view to export results from the main view.
Eliminated Pydantic warnings in the CLI.
Improved error messages when fetch_config is called with wrong arguments.
Enhanced the custom code evaluation sandbox and removed the limitation that results need to be between 0 and 1

Evaluator Testing Playground and a New Evaluation View

September 22, 2024

Many users faced challenges configuring evaluators in the web UI. Some evaluators, such as LLM as a judge, custom code, or RAG evaluators can be tricky to set up correctly on the first try. Until now, users needed to setup, run an evaluation, check the errors, then do it again.

To address this, we've introduced a new evaluator test/debug playground. This feature allows you to test the evaluator live on real data, helping you test the configuration before committing to it and using it for evaluations.

Additionally, we have improved and redesigned the evaluation view. Both automatic and human evaluations are now within the same view but in different tabs. We're moving towards unifying all evaluator results and consolidating them in one view, allowing you to quickly get an overview of what's working.

UI Redesign and Configuration Management and Overview View

August 22, 2024

We've completely redesigned the platform's UI. Additionally we have introduced a new overview view for your applications. This is part of a series of upcoming improvements slated for the next few weeks.

The new overview view offers:

A dashboard displaying key metrics of your application
A table with all the variants of your applications
A summary of your application's most recent evaluations

We've also added a new JSON Diff evaluator. This evaluator compares two JSON objects and provides a similarity score.

Lastly, we've updated the UI of our documentation.

Viewing traces in the playground:​

Authentication improvements:​

Documentation improvements:​

Bug fixes:​

Viewing traces in the playground:

Authentication improvements:

Documentation improvements:

Bug fixes: