Blog - Docs - Agenta

Projects within Organizations

December 4, 2025

You can now create projects within an organization. This feature helps you organize your work when you're building multiple AI products or managing different teams working on separate initiatives.

What Are Projects?

Projects provide a way to isolate and organize your AI work within an organization. Each project maintains its own scope for:

Prompts: All prompt templates and variants stay within the project
Traces: Observability data is scoped to the project that generated it
Evaluations: Test sets, evaluators, and evaluation results belong to specific projects

This scoping prevents clutter and makes it easy to focus on one product at a time.

Creating and Managing Projects

You can create a new project directly from the sidebar in the Agenta interface. Once created, you can switch between projects using the sidebar navigation.

Each team member can work in different projects simultaneously. The interface remembers your last active project, making it easy to pick up where you left off.

When to Use Projects

Projects work well when you need to:

Build multiple AI products for different use cases
Separate development work for different teams or departments
Keep client work isolated from internal tools

Next Steps

If you're managing complex AI initiatives across multiple products, projects give you the structure to keep everything organized. You can create your first project from the sidebar and start organizing your prompts and evaluations.

For questions about projects or organizational structure, check the FAQ or reach out through our support channels.

Changelog

December 2, 2025

Projects within Organizations

4 December 2025

v0.65.0

You can now create projects within an organization. This lets you divide your work between different AI products. Each project scopes its prompts, traces, and evaluations. Create a new project or navigate between projects directly from the sidebar.

Reasoning Effort Support in the Playground

18 November 2025

v0.62.5

You can now configure reasoning effort for models that support this parameter, such as OpenAI's o1 series and Google's Gemini 2.5 Pro. The reasoning effort setting is part of your prompt template, making it available when you fetch prompts via the SDK or invoke them through Agenta as an LLM gateway.

Jinja2 Template Support in the Playground

17 November 2025

v0.62.3

You can now use Jinja2 templates in your prompts. Jinja2 is available in both the Playground and in prompt management.

Learn more in our blog post or check the documentation.

Agenta Core is Now Open Source

13 November 2025

We're open sourcing the core of Agenta under the MIT license. All functional features are now available to the community. This includes the evaluation system, prompt playground and management, observability, and all core workflows.

Development moves back to the public repository. We're building in public again. Only enterprise collaboration features like RBAC, SSO, and audit logs remain under a separate license.

Get started with the self-hosting guide. View the code and contribute on GitHub. Read why we made this decision at agenta.ai/blog/commercial-open-source-is-hard-our-journey.

Evaluation SDK

12 November 2025

v0.62.0

You can now run programmatic evaluations of complex AI agents and workflows directly from code. The Evaluation SDK gives you full control over test data and evaluation logic. It works with agents built using any framework.

The SDK lets you create test sets in code or fetch them from Agenta. You can use built-in evaluators like LLM-as-a-Judge, semantic similarity, or regex matching. You can also write custom Python evaluators. The SDK evaluates end-to-end workflows or specific spans in execution traces. Evaluations run on your own infrastructure; results display in the Agenta dashboard.

Check out the Evaluation SDK documentation to get started.

Online Evaluation

11 November 2025

v0.62.0

You can now automatically evaluate every request to your LLM application in production. Online Evaluation helps you catch hallucinations and off-brand responses as they happen. You no longer need to discover problems through user complaints.

You can configure evaluators like LLM-as-a-Judge with custom prompts. Set sampling rates to control costs. Create evaluations with filters for specific spans in your traces. All evaluated requests appear in one dashboard. You can filter traces by evaluation scores to understand issues. You can also add problematic cases to test sets for continuous improvement.

Setting up online evaluation takes just a couple of minutes. It provides immediate visibility into production quality.

Customize LLM-as-a-Judge Output Schemas

10 November 2025

v0.62.0

The LLM-as-a-Judge evaluator now supports custom output schemas. Create multiple feedback outputs per evaluator with any structure you need.

You can configure output types (binary, multiclass), include reasoning to improve prediction quality, or provide a raw JSON schema with any structure you define. Use these custom schemas in your evaluations to capture exactly the feedback you need.

Learn more in the LLM-as-a-Judge documentation.

Documentation Overhaul

3 November 2025

v0.59.10

We've completely rewritten and restructured our documentation with a new architecture. This is one of the largest updates we've made, involving a near-complete rewrite of existing content.

Key improvements include:

Diataxis Framework: Organized content into Tutorials, How-to Guides, Reference, and Explanation sections for better discoverability
Expanded Observability Docs: Added missing documentation for tracing, annotations, and observability features
JavaScript/TypeScript Support: Added code examples and documentation for JavaScript developers alongside Python
Ask AI Feature: Ask questions directly to the documentation for instant answers

Vertex AI Provider Support

24 October 2025

v0.59.6

We've added support for Google Cloud's Vertex AI platform. You can now use Gemini models and other Vertex AI partner models in the playground, configure them in the Model Hub, and access them through the Gateway using InVoke endpoints.

Check out the documentation for configuring Vertex AI models.

Filtering Traces by Annotation

14 October 2025

v0.58.0

You can now filter and search traces based on their annotations. This helps you find traces with low scores or bad feedback quickly.

We rebuilt the filtering system in observability with a simpler dropdown and more options. You can now filter by span status, input keys, app or environment references, and any key within your span.

The new annotation filtering lets you find:

Spans evaluated by a specific evaluator
Spans with user feedback like success=True

This enables powerful workflows: capture user feedback from your app, filter to find traces with bad feedback, add them to test sets, and improve your prompts based on real user data.

New Evaluation Results Dashboard

26 September 2025

v0.54.0

We've completely redesigned the evaluation results dashboard. You can analyse your evaluation results more easily and understand performance across different metrics.

Here's what's new:

Metrics plots: We've added plots for all the evaluator metrics. You can not see the distribution of the results and easily spot outliers.
Side-by-side comparison: You can now compare multiple evaluations simultaneously. You can compare the plots but also the single outputs.
Improved test cases view: The results are now displayed in a tabular format works both for small and large datasets.
Focused detail view: A new focused drawer lets you examine individual data points in more details. It's very helpful if your data is large.
Configuration view: See exactly which configurations were used in each evaluation
Evaluation Run naming and descriptions: Add names and descriptions to your evaluation runs to organize things better.

Deep URL Support for Sharable Links

24 September 2025

v0.53.0

URLs across Agenta now include workspace context, making them fully shareable between team members. Previously, URLs would always point to the default workspace, causing issues when refreshing pages or sharing links.

Now you can deep link to almost anything in the platform - prompts, evaluations, and more - in any workspace. Share links directly with team members and they'll see exactly what you intended, regardless of their default workspace settings.

Major Speed Improvements and Bug Fixes

19 September 2025

v0.52.5

We rewrote most of Agenta's frontend. You'll see much faster speeds when you create prompts or use the playground.

We also made many improvements and fixed bugs:

Improvements:

LLM-as-a-judge now uses double curly braces {{}} instead of single curly braces { and }. This matches how normal prompts work. Old LLM-as-a-judge prompts with single curly braces still work. We updated the LLM-as-a-judge playground to make editing prompts easier.

Self-hosting:

You can now use an external Redis instance for caching by setting it as an environment variable

Bug fixes:

Fixed the custom workflow quick start tutorial and examples
Fixed SDK compatibility issues with Python 3.9
Fixed default filtering in observability dashboard
Fixed error handling in the evaluator playground
Fixed the Tracing SDK to allow instrumenting streaming responses and overriding OTEL environment variables

Multiple Metrics in Human Evaluation

9 September 2025

v0.51.0

We rebuilt the human evaluation workflow from scratch. Now you can set multiple evaluators and metrics and use them to score the outputs.

This lets you evaluate the same output on different metrics like relevance or completeness. You can also create binary, numerical scores, or even use strings for comments or expected answer.

Watch the video below and read the post for more details. Or check out the docs to learn how to use the new human evaluation workflow.

DSPy Integration

29 August 2025

We've added DSPy integration to Agenta. You can now trace and debug your DSPy applications with Agenta.

View the full DSPy integration →

Open-sourcing our Product Roadmap

12 August 2025

We've made our product roadmap completely transparent and community-driven.

You can now see exactly what we're building, what's shipped, and what's coming next. Plus vote on features that matter most to you.

Why we're doing this: We believe open-source startups succeed when they create the most value possible, and the best way to do that is by building with our community, not in isolation. Up until now, we've been secretive with our roadmap, but we're losing something important: your feedback and the ability to let you shape our direction. Today we're open-sourcing our roadmap because we want to build a community of owners, not just passive users.

View the full roadmap →

Major Playground Improvements and Enhancements

7 August 2025 v0.50.5

We've made significant improvements to the playground. Key features include:

Improving the error handling in JSON editor for structured output
Preventing the JSON field order from being changed
Visual diff when committing changes
Markdown and text view toggle
Collapsible interface elements
Collapsible test cases for large sets

Support for Images in the Playground

29 July 2025 v0.50.0

Agenta now supports images in the playground, test sets, and evaluations. Click above for more details.

LlamaIndex Integration

17 June 2025 v0.48.4

We're excited to announce observability support for LlamaIndex applications.

If you're using LlamaIndex, you can now see detailed traces in Agenta to debug your application.

The integration is auto-instrumentation - just add one line of code and you'll start seeing all your LlamaIndex operations traced.

This helps when you need to understand what's happening inside your RAG pipeline, track performance bottlenecks, or debug issues in production.

We've put together a Jupyter notebook and tutorial to get you started. Links are in the comments.

Annotate Your LLM Response (preview)

15 May 2025 v0.45.0

One of the major feature requests we had was the ability to capture user feedback and annotations (e.g. scores) to LLM responses traced in Agenta.

Today we're previewing one of a family of features around this topic.

As of today you can use the annotation API to add annotations to LLM responses traced in Agenta.

This is useful to:

Collect user feedback on LLM responses
Run custom evaluation workflows
Measure application performance in real-time

Check out the how to annotate traces from API for more details. Or try our new tutorial (available as jupyter notebook) here.

Other stuff:

We have cut our migration process to take a couple of minutes instead of an hour.

Tool Support in the Playground

10 May 2025 v0.43.1

We released tool usage in the Agenta playground - a key feature for anyone building agents with LLMs.

Agents need tools to access external data, perform calculations, or call APIs.

Now you can:

Define tools directly in the playground using JSON schema
Test how your prompt generates tool calls in real-time
Preview how your agent handles tool responses
Verify tool call correctness with custom evaluators

The tool schema is saved with your prompt configuration, making integration easy when you fetch configs through the API.

Documentation Overhaul, New Models, and Platform Improvements

2 May 2025

v0.43.0

We've made significant improvements across Agenta with a major documentation overhaul, new model support, self-hosting enhancements, and UI improvements.

Revamped Prompt Engineering Documentation:

We've completely rewritten our prompt management and prompt engineering documentation.

Start exploring the new documentation in our updated Quick Start Guide.

New Model Support:

Our platform now supports several new LLM models:

Google's Gemini 2.5 Pro and Flash
Alibaba Cloud's Qwen 3
OpenAI's GPT-4.1

These models are available in both the playground and through the API.

Playground Enhancements:

We've added a draft state to the playground, providing a better editing experience. Changes are now clearly marked as drafts until committed.

Self-Hosting Improvements:

We've significantly simplified the self-hosting experience by changing how environment variables are handled in the frontend:

No more rebuilding images to change ports or domains
Dynamic configuration through environment variables at runtime

Check out our updated self-hosting documentation for details.

Bug Fixes and Optimizations:

Fixed OpenTelemetry integration edge cases
Resolved edge cases in the API that affected certain workflow configurations
Improved UI responsiveness and fixed minor visual inconsistencies
Added chat support in cloud

We are SOC 2 Type 2 Certified

18 April 2025 v0.42.1

We are SOC 2 Type 2 Certified. This means that our platform is audited and certified by an independent third party to meet the highest standards of security and compliance.

Structured Output Support in the Playground

15 April 2025

v0.42.0

We now support structured output support in the playground. You can define the expected output format and validate the output against it.

With Agenta's playground, implementing structured outputs is straightforward:

Open any prompt
Switch the Response format dropdown from text to JSON mode or JSON Schema
Paste or write your schema (Agenta supports the full JSON Schema specification)
Run the prompt - the response panel will show the response beautified
Commit the changes - the schema will be saved with your prompt, so when your SDK fetches the prompt, it will include the schema information

Check out the blog post for more detail https://agenta.ai/blog/structured-outputs-playground

New Feature: Prompt and Deployment Registry

7 April 2025

v0.38.0

We've introduced the Prompt and Deployment Registry, giving you a centralized place to manage all variants and versions of your prompts and deployments.

Key capabilities:

View all variants and revisions in a single table
Access all commits made to a variant
Use older versions of variants directly in the playground

Learn more in our blog post.

Bug Fixes

Fixed minor UI issues with dots in sidebar menu
Fixed minor playground UI issues
Fixed playground reset default model name
Fixed project_id issue on testset detail page
Fixed breaking issues with old variants encountered during QA
Fixed variant naming logic

Improvements to the Playground and Custom Workflows

19 March 2025 v0.36.0

We've made several improvements to the playground, including:

Improved scrolling behavior
Increased discoverability of variants creation and comparison
Implemented stop functionality in the playground

As for custom workflows, now they work with sub-routes. This means you can have multiple routes in one file and create multiple custom workflows from the same file.

OpenTelemetry Compliance and Custom workflows from API

11 March 2025

v0.35.0

We've introduced major improvements to Agenta, focusing on OpenTelemetry compliance and simplified custom workflow debugging.

OpenTelemetry (OTel) Support:

Agenta is now fully OpenTelemetry-compliant. This means you can seamlessly integrate Agenta with thousands of OTel-compatible services using existing SDKs. To integrate your application with Agenta, simply configure an OTel exporter pointing to your Agenta endpoint—no additional setup required.

We've enhanced distributed tracing capabilities to better debug complex distributed agent systems. All HTTP interactions between agents—whether running within Agenta's SDK or externally—are automatically traced, making troubleshooting and monitoring easier.

Detailed instructions and examples are available in our distributed tracing documentation.

Improved Custom Workflows:

Based on your feedback, we've streamlined debugging and running custom workflows:

Run workflows from your environments: You no longer need the Agenta CLI to manage custom workflows. Setting up custom workflows now involves simply adding the Agenta SDK to your code, creating an endpoint, and connecting it to Agenta via the web UI. You can check how it's done in the quick start guide.
Custom Workflows in the new playground: Custom workflows are now fully compatible with the new playground. You can now nest configurations, run side-by-side comparisons, and debug your agents and complex workflows very easily.

New Playground

4 February 2025

v0.33.0

We've rebuilt our playground from scratch to make prompt engineering faster and more intuitive. The old playground took 20 seconds to create a prompt - now it's instant.

Key improvements:

Create prompts with multiple messages using our new template system
Format variables easily with curly bracket syntax and a built-in validator
Switch between chat and completion prompts in one interface
Load test sets directly in the playground to iterate faster
Save successful outputs as test cases with one click
Compare different prompts side-by-side
Deploy changes straight to production

For developers, now you create prompts programmatically through our API.

You can explore these features in our updated playground documentation.

Quality of life improvements

27 January 2025

v0.32.0

Small release today with quality of life improvements, while we're preparing the huge release coming up in the next days:

Added a collapsible side menu for better space management
Enhanced frontend performance and responsiveness
Implemented a confirmation modal when deleting test sets
Improved permission handling across the platform
Improved frontend test coverage

Agenta is SOC 2 Type 1 Certified

15 January 2025

v0.31.0

We've achieved SOC 2 Type 1 certification, validating our security controls for protecting sensitive LLM development data. This certification covers our entire platform, including prompt management, evaluation frameworks, and observability tools.

Key security features and improvements:

Data encryption in transit and at rest
Enhanced access control and authentication
Comprehensive security monitoring
Regular third-party security assessments
Backup and disaster recovery protocols

This certification represents a significant milestone for teams using Agenta in production environments. Whether you're using our open-source platform or cloud offering, you can now build LLM applications with enterprise-grade security confidence.

We've also updated our trust center with detailed information about our security practices and compliance standards. For teams interested in learning more about our security controls or requesting our SOC 2 report, please contact [email protected].

New Onboarding Flow

4 January 2025

v0.30.0

We've redesigned our platform's onboarding to make getting started simpler and more intuitive. Key improvements include:

Streamlined tracing setup process
Added a demo RAG playground project showcasing custom workflows
Enhanced frontend performance
Fixed scroll behavior in trace view

Add Spans to Test Sets

11 December 2024

v0.29.0

This release introduces the ability to add spans to test sets, making it easier to bootstrap your evaluation data from production. The new feature lets you:

Add individual or batch spans to test sets
Create custom mappings between spans and test sets
Preview test set changes before committing them

Additional improvements:

Fixed CSV test set upload issues
Prevented viewing of incomplete evaluations
Added mobile compatibility warning
Added support for custom ports in self-hosted installations

Viewing Traces in the Playground and Authentication for Deployed Applications

29 November 2024

v0.28.0

Viewing traces in the playground:

You can now see traces directly in the playground. For simple applications, this means you can view the prompts sent to LLMs. For custom workflows, you get an overview of intermediate steps and outputs. This makes it easier to understand what’s happening under the hood and debug your applications.

Authentication improvements:

We’ve strengthened authentication for deployed applications. As you know, Agenta lets you either fetch the app’s config or call it with Agenta acting as a proxy. Now, we’ve added authentication to the second method. The APIs we create are now protected and can be called using an API key. You can find code snippets for calling the application in the overview page.

Documentation improvements:

We’ve added new cookbooks and updated existing documentation:

New cookbook for observability with LangChain
Updated the custom workflows documentation and added reference
Updated the reference for the observability SDK and for the prompt management SDK

Bug fixes:

Fixed an issue with the observability SDK not being compatible with LiteLLM.
Fixed an issue where cost and token usage were not correctly computed for all calls.

Observability and Prompt Management

6 November 2024

v0.27.0

This release is one of our biggest yet—one changelog hardly does it justice.

First up: Observability

We’ve had observability in beta for a while, but now it’s been completely rewritten, with a brand-new UI and fully open-source code.

The new Observability SDK is compatible with OpenTelemetry (Otel) and gen-ai semantic conventions. This means you get a lot of integrations right out of the box, like LangChain, OpenAI, and more.

We’ll publish a full blog post soon, but here’s a quick look at what the new observability offers:

A redesigned UI that lets you visualize nested traces, making it easier to understand what’s happening behind the scenes.
The web UI lets you filter traces by name, cost, and other attributes—you can even search through them easily.
The SDK is Otel-compatible, and we’ve already tested integrations for OpenAI, LangChain, LiteLLM, and Instructor, with guides available for each. In most cases, adding a few lines of code will have you seeing traces directly in Agenta.

Next: Prompt Management

We’ve completely rewritten the prompt management SDK, giving you full CRUD capabilities for prompts and configurations. This includes creating, updating, reading history, deploying new versions, and deleting old ones. You can find a first tutorial for this here.

And finally: LLM-as-a-Judge Overhaul

We've made significant upgrades to the LLM-as-a-Judge evaluator. It now supports prompts with multiple messages and has access to all variables in a test case. You can also switch models (currently supporting OpenAI and Anthropic). These changes make the evaluator much more flexible, and we're seeing better results with it.

Configuring the LLM-as-a-Judge evaluator

New Application Management View and Various Improvements

22 October 2024

v0.26.0

We updated the Application Management View to improve the UI. Many users struggled to find their applications when they had a large number, so we've improved the view and added a search bar for quick filtering. Additionally, we are moving towards a new project structure for the application. We moved test sets and evaluators outside of the application scope. So now, you can use the same test set and evaluators in multiple applications.

Bug Fixes

Added an export button in the evaluation view to export results from the main view.
Eliminated Pydantic warnings in the CLI.
Improved error messages when fetch_config is called with wrong arguments.
Enhanced the custom code evaluation sandbox and removed the limitation that results need to be between 0 and 1

Evaluator Testing Playground and a New Evaluation View

22 September 2024

v0.25.0

Many users faced challenges configuring evaluators in the web UI. Some evaluators, such as LLM as a judge, custom code, or RAG evaluators can be tricky to set up correctly on the first try. Until now, users needed to setup, run an evaluation, check the errors, then do it again.

To address this, we've introduced a new evaluator test/debug playground. This feature allows you to test the evaluator live on real data, helping you test the configuration before committing to it and using it for evaluations.

Additionally, we have improved and redesigned the evaluation view. Both automatic and human evaluations are now within the same view but in different tabs. We're moving towards unifying all evaluator results and consolidating them in one view, allowing you to quickly get an overview of what's working.

UI Redesign and Configuration Management and Overview View

22 August 2024

v0.24.0

We've completely redesigned the platform's UI. Additionally we have introduced a new overview view for your applications. This is part of a series of upcoming improvements slated for the next few weeks.

The new overview view offers:

A dashboard displaying key metrics of your application
A table with all the variants of your applications
A summary of your application's most recent evaluations

We've also added a new JSON Diff evaluator. This evaluator compares two JSON objects and provides a similarity score.

Lastly, we've updated the UI of our documentation.

New Alpha Version of the SDK for Creating Custom Applications

20 August 2024

v0.23.0

We've released a new version of the SDK for creating custom applications. This Pydantic-based SDK significantly simplifies the process of building custom applications. It's fully backward compatible, so your existing code will continue to work seamlessly. We'll soon be rolling out comprehensive documentation and examples for the new SDK.

In the meantime, here's a quick example of how to use it:

import agenta as ag
from agenta import Agenta
from pydantic import BaseModel, Field

ag.init()

# Define the configuration of the application (that will be shown in the playground )
class MyConfig(BaseModel):
    temperature: float = Field(default=0.2)
    prompt_template: str = Field(default="What is the capital of {country}?")

# Creates an endpoint for the entrypoint of the application
@ag.route("/", config_schema=MyConfig)
def generate(country: str) -> str:
    # Fetch the config from the request
    config: MyConfig = ag.ConfigManager.get_from_route(schema=MyConfig)
    prompt = config.prompt_template.format(country=country)
    chat_completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=config.temperature,
    )
    return chat_completion.choices[0].message.content

RAGAS Evaluators and Traces in the Playground

12 August 2024

v0.22.0

We're excited to announce two major features this week:

We've integrated RAGAS evaluators into agenta. Two new evaluators have been added: RAG Faithfulness (measuring how consistent the LLM output is with the context) and Context Relevancy (assessing how relevant the retrieved context is to the question). Both evaluators use intermediate outputs within the trace to calculate the final score.

Check out the tutorial to learn how to use RAG evaluators.

You can now view traces directly in the playground. This feature enables you to debug your application while configuring it—for example, by examining the prompts sent to the LLM or reviewing intermediate outputs.

note

Both features are available exclusively in the cloud and enterprise versions of agenta.

Migration from MongoDB to Postgres

9 July 2024

v0.19.0

We have migrated the Agenta database from MongoDB to Postgres. As a result, the platform is much more faster (up to 10x in some use cases).

However, if you are self-hosting agenta, note that this is a breaking change that requires you to manually migrate your data from MongoDB to Postgres.

If you are using the cloud version of Agenta, there is nothing you need to do (other than enjoying the new performance improvements).

More Reliable Evaluations

5 July 2024

v0.18.0

We have worked extensively on improving the reliability of evaluations. Specifically:

We improved the status for evaluations and added a new Queued status
We improved the error handling in evaluations. Now we show the exact error message that caused the evaluation to fail.
We fixed issues that caused evaluations to run infinitely
We fixed issues in the calculation of scores in human evaluations.
We fixed small UI issues with large output in human evaluations.
We have added a new export button in the evaluation view to export the results as a CSV file.

In observability:

We have added a new integration with Litellm to automatically trace all LLM calls done through it.
Now we automatically propagate cost and token usage from spans to traces.

Evaluators can access all columns

4 June 2024

v0.17.0

Evaluators now can access all columns in the test set. Previously, you were limited to using only the correct_answer column for the ground truth / reference answer in evaluation. Now you can configure your evaluator to use any column in the test set as the ground truth. To do that, open the collapsable Advanced Settings when configuring the evaluator, and define the Expected Answer Column to the name of the columns containing the reference answer you want to use.

In addition to this:

We've upgraded the SDK to pydantic v2.
We have improved by 10x the speed for the get config endpoint
We have add documentation for observability

New LLM Provider: Welcome Gemini!

25 May 2024

v0.14.14

We are excited to announce the addition of Google's Gemini to our list of supported LLM providers, bringing the total number to 12.

Playground Improvements

24 May 2024

v0.14.1-13

We've improved the workflow for adding outputs to a dataset in the playground. In the past, you had to select the name of the test set each time. Now, the last used test set is selected by default..
We have significantly improved the debugging experience when creating applications from code. Now, if an application fails, you can view the logs to understand the reason behind the failure.
We moved the copy message button in the playground to the output text area.
We now hide the cost and usage in the playground when they aren't specified
We've made improvements to error messages in the playground

Bug Fixes

Fixed the order of the arguments when running a custom code evaluator
Fixed the timestamp in the Testset view (previous stamps was droppping the trailing 0)
Fixed the creation of application from code in the self-hosted version when using Windows

Prompt and Configuration Registry

1 May 2024

v0.14.0

We've introduced a feature that allows you to use Agenta as a prompt registry or management system. In the deployment view, we now provide an endpoint to directly fetch the latest version of your prompt. Here is how it looks like:

from agenta import Agenta
agenta = Agenta()
config = agenta.get_config(base_id="xxxxx", environment="production", cache_timeout=200) # Fetches the configuration with caching

You can find additional documentation here.

Improvements

Previously, publishing a variant from the playground to an environment was a manual process., from now on we are publishing by default to the production environment.

Miscellaneous Improvements

28 April 2024

v0.13.8

The total cost of an evaluation is now displayed in the evaluation table. This allows you to understand how much evaluations are costing you and track your expenses.

Bug Fixes

Fixed sidebar focus in automatic evaluation results view
Fix the incorrect URLs shown when running agenta variant serve

Evaluation Speed Increase and Numerous Quality of Life Improvements

23rd April 2024

v0.13.1-5

We've improved the speed of evaluations by 3x through the use of asynchronous batching of calls.
We've added Groq as a new provider along with Llama3 to our playground.

Bug Fixes

Resolved a rendering UI bug in Testset view.
Fixed incorrect URLs displayed when running the 'agenta variant serve' command.
Corrected timestamps in the configuration.
Resolved errors when using the chat template with empty input.
Fixed latency format in evaluation view.
Added a spinner to the Human Evaluation results table.
Resolved an issue where the gitignore was being overwritten when running 'agenta init'.

Observability (beta)

14th April 2024

v0.13.0

You can now monitor your application usage in production. We've added a new observability feature (currently in beta), which allows you to:

Monitor cost, latency, and the number of calls to your applications in real-time.
View the logs of your LLM calls, including inputs, outputs, and used configurations. You can also add any interesting logs to your test set.
Trace your more complex LLM applications to understand the logic within and debug it.

As of now, all new applications created will include observability by default. We are working towards a GA version in the next weeks, which will be scalable and better integrated with your applications. We will also be adding tutorials and documentation about it.

Find examples of LLM apps created from code with observability here.

Compare latency and costs

1st April 2024

v0.12.6

You can now compare the latency and cost of different variants in the evaluation view.

Minor improvements

31st March 2024

v0.12.5

Toggle variants in comparison view

You can now toggle the visibility of variants in the comparison view, allowing you to compare a multitude of variants side-by-side at the same time.

Improvements

You can now add a datapoint from the playground to the test set even if there is a column mismatch

Bug fixes

Resolved issue with "Start Evaluation" button in Testset view
Fixed bug in CLI causing variant not to serve

New evaluators

25th March 2024

v0.12.4

We have added some more evaluators, a new string matching and a Levenshtein distance evaluation.

Improvements

Updated documentation for human evaluation
Made improvements to Human evaluation card view
Added dialog to indicate testset being saved in UI

Bug fixes

Fixed issue with viewing the full output value during evaluation
Enhanced error boundary logic to unblock user interface
Improved logic to save and retrieve multiple LLM provider keys
Fixed Modal instances to support dark mode

Minor improvements

11th March 2024

v0.12.3

Improved the logic of the Webhook evaluator
Made the inputs in the Human evaluation view non-editable
Added an option to save a test set in the Single model evaluation view
Included the evaluator name in the "Configure your evaluator" modal

Bug fixes

Fixed column resize in comparison view
Resolved a bug affecting the evaluation output in the CSV file
Corrected the path to the Evaluators view when navigating from Evaluations

Highlight ouput difference when comparing evaluations

4th March 2024

v0.12.2

We have improved the evaluation comparison view to show the difference to the expected output.

Improvements

Improved the error messages when invoking LLM applications
Improved "Add new evaluation" modal
Upgraded Sidemenu to display Configure evaluator and run evaluator under Evaluations section
Changed cursor to pointer when hovering over evaluation results

Deployment Versioning and RBAC

14th February 2024

v0.12.0

Deployment versioning

You now have access to a history of prompts deployed to our three environments. This feature allows you to roll back to previous versions if needed.

Role-Based Access Control

You can now invite team members and assign them fine-grained roles in agenta.

Improvements

We now prevent the deletion of test sets that are used in evaluations

Bug fixes

Fixed bug in custom code evaluation aggregation. Up until know the aggregated result for custom code evalution where not computed correctly.
Fixed bug with Evaluation results not being exported correctly
Updated documentation for vision gpt explain images
Improved Frontend test for Evaluations

Minor fixes

4th February 2024

v0.10.2

Addressed issue when invoking LLM app with missing LLM provider key
Updated LLM providers in Backend enum
Fixed bug in variant environment deployment
Fixed the sorting in evaluation tables
Made use of server timezone instead of UTC

Prompt Versioning

31st January 2024

v0.10.0

We've introduced the feature to version prompts, allowing you to track changes made by the team and revert to previous versions. To view the change history of the configuration, click on the sign in the playground to access all previous versions.

New JSON Evaluator

30th January 2024

v0.9.1 We have added a new evaluator to match JSON fields and added the possiblity to use other columns in the test set other than the correct_answer column as the ground truth.

Improved error handling in evaluation

29th January 2024

v0.9.0

We have improved error handling in evaluation to return more information about the exact source of the error in the evaluation view.

Improvements:

Added the option in A/B testing human evaluation to mark both variants as correct
Improved loading state in Human Evaluation

Bring your own API key

25th January 2024

v0.8.3

Up until know, we required users to use our OpenAI API key when using cloud. Starting now, you can use your own API key for any new application you create.

Improved human evaluation workflow

24th January 2024

v0.8.2

Faster human evaluation workflow

We have updated the human evaluation table view to add annotation and correct answer columns.

Improvements:

Simplified the database migration process
Fixed environment variable injection to enable cloud users to use their own keys
Disabled import from endpoint in cloud due to security reasons
Improved query lookup speed for evaluation scenarios
Improved error handling in playground

Bug fixes:

Resolved failing Backend tests
Fixed a bug in rate limit configuration validation
Fixed issue with all aggregated results
Resolved issue with live results in A/B testing evaluation not updating

Revamping evaluation

22nd January 2024

v0.8.0

We've spent the past month re-engineering our evaluation workflow. Here's what's new:

Running Evaluations

Simultaneous Evaluations: You can now run multiple evaluations for different app variants and evaluators concurrently.

Rate Limit Parameters: Specify these during evaluations and reattempts to ensure reliable results without exceeding open AI rate limits.

Reusable Evaluators: Configure evaluators such as similarity match, regex match, or AI critique and use them across multiple evaluations.

Evaluation Reports

Dashboard Improvements: We've upgraded our dashboard interface to better display evaluation results. You can now filter and sort results by evaluator, test set, and outcomes.

Comparative Analysis: Select multiple evaluation runs and view the results of various LLM applications side-by-side.

Adding Cost and Token Usage to the Playground

12th January 2024

v0.7.1

caution

This change requires you to pull the latest version of the agenta platform if you're using the self-serve version.

We've added a feature that allows you to compare the time taken by an LLM app, its cost, and track token usage, all in one place.

----#

Changes to the SDK

This necessitated modifications to the SDK. Now, the LLM application API returns a JSON instead of a string. The JSON includes the output message, usage details, and cost:

{
"message": string,
"usage": {
"prompt_tokens": int,
"completion_tokens": int,
"total_tokens": int
},
"cost": float
}

Improving Side-by-side Comparison in the Playground

19th December 2023

v0.6.6

Enhanced the side-by-side comparison in the playground for better user experience

Resolved Batch Logic Issue in Evaluation

18th December 2023

v0.6.5

Resolved an issue with batch logic in evaluation (users can now run extensive evaluations)

Comprehensive Updates and Bug Fixes

12th December 2023

v0.6.4

Incorporated all chat turns to the chat set
Rectified self-hosting documentation
Introduced asynchronous support for applications
Added 'register_default' alias
Fixed a bug in the side-by-side feature

Integrated File Input and UI Enhancements

12th December 2023

v0.6.3

Integrated file input feature in the SDK
Provided an example that includes images
Upgraded the human evaluation view to present larger inputs
Fixed issues related to data overwriting in the cloud
Implemented UI enhancements to the side bar

Minor Adjustments for Better Performance

7th December 2023

v0.6.2

Made minor adjustments

Bug Fix for Application Saving

7th December 2023

v0.6.1

Resolved a bug related to saving the application

Introduction of Chat-based Applications

1st December 2023

v0.6.0

Introduced chat-based applications
Fixed a bug in 'export csv' feature in auto evaluation

Multiple UI and CSV Reader Fixes

1st December 2023

v0.5.8

Fixed a bug impacting the csv reader
Addressed an issue of variant overwriting
Made tabs draggable for better UI navigation
Implemented support for multiple LLM keys in the UI

Enhanced Self-hosting and Mistral Model Tutorial

17th November 2023

v0.5.7

Enhanced and simplified self-hosting feature
Added a tutorial for the Mistral model
Resolved a race condition issue in deployment
Fixed an issue with saving in the playground

Sentry Integration and User Communication Improvements

12th November 2023

v0.5.6

Enhanced bug tracking with Sentry integration in the cloud
Integrated Intercom for better user communication in the cloud
Upgraded to the latest version of OpenAI
Cleaned up files post serving in CLI

Cypress Tests and UI Improvements

2nd November 2023

v0.5.5

Conducted extensive Cypress tests for improved application stability
Added a collapsible sidebar for better navigation
Improved error handling mechanisms
Added documentation for the evaluation feature

Launch of SDK Version 2 and Cloud-hosted Version

23rd October 2023

v0.5.0

Launched SDK version 2
Launched the cloud-hosted version
Completed a comprehensive refactoring of the application

Reasoning Effort Support in the Playground

November 18, 2025

You can now configure reasoning effort for models that support this parameter, such as OpenAI's o1 series and Google's Gemini 2.5 Pro.

Reasoning effort controls how much computational thinking the model applies before generating a response. This is particularly useful for complex reasoning tasks where you want to balance response quality with latency and cost.

The reasoning effort parameter is part of your prompt template configuration. When you fetch prompts via the SDK or invoke them through Agenta as an LLM gateway, the reasoning effort setting is included in the configuration and applied to your requests automatically.

This gives you fine-grained control over model behavior directly from the playground, making it easier to optimize for your specific use case.

Jinja2 Template Support in the Playground

November 17, 2025

We're excited to announce a powerful update to the Agenta playground. You can now use Jinja2 templating in your prompts.

This means you can add sophisticated logic directly into your prompt templates. Use conditional statements, apply filters to variables, and transform data on the fly.

Learn more in our blog post or check the documentation.

Example

Here's a prompt template that uses Jinja2 to adapt based on user expertise level:

You are {% if expertise_level == "beginner" %}a friendly teacher who explains concepts in simple terms{% else %}a technical expert providing detailed analysis{% endif %}.

Explain {{ topic }} {% if include_examples %}with practical examples{% endif %}.

{% if False %} {{expertise_level}} {{include_examples}} {% endif %}

Note: The {% if False %} block makes variables available to the playground without including them in the final prompt.

Using Jinja2 Prompts

When you fetch a Jinja2 prompt via the SDK, you get the template format included in the configuration:

{
  "prompt": {
    "messages": [
      {
        "role": "user",
        "content": "You are {% if expertise_level == \"beginner\" %}a friendly teacher...{% endif %}"
      }
    ],
    "llm_config": {
      "model": "gpt-4",
      "temperature": 0.7
    },
    "template_format": "jinja2"
  }
}

The template_format field tells Agenta how to process your variables. This works both when invoking prompts through Agenta as an LLM gateway and when fetching prompts programmatically via the SDK.

Agenta Core is Now Open Source

November 13, 2025

We're open sourcing the core of Agenta under the MIT license. All functional features are now available to the community.

What's Open Source

Every feature you need to build, test, and deploy LLM applications is now open source. This includes the evaluation system, prompt playground and management, observability, and all core workflows.

You can run evaluations using LLM-as-a-Judge, custom code evaluators, or any built-in evaluator. Create and manage test sets. Evaluate end-to-end workflows or specific spans in traces.

Experiment with prompts in the playground. Version and commit changes. Deploy to environments. Fetch configurations programmatically.

Trace your LLM applications with OpenTelemetry support. View detailed execution traces. Monitor costs and performance. Filter and search traces.

Building in Public Again

We've moved development back to the public repository. You can see what we're building, contribute features, and shape the product direction.

What Remains Under Commercial License

Only enterprise collaboration features stay under a separate license. This includes role-based access control (RBAC), single sign-on (SSO), and audit logs. These features support teams with specific compliance and security requirements.

Get Started

Follow the self-hosting quick start guide to deploy Agenta on your infrastructure. View the source code and contribute on GitHub. Read why we made this decision at agenta.ai/blog/commercial-open-source-is-hard-our-journey.

What This Means for You

You can run Agenta on your infrastructure with full access to evaluation, prompting, and observability features. You can modify the code to fit your needs. You can contribute back to the project.

The MIT license gives you freedom to use, modify, and distribute Agenta. We believe open source creates better products through community collaboration.

Evaluation SDK

November 12, 2025

The Evaluation SDK lets you run evaluations programmatically from code. You get full control over test data and evaluation logic. You can evaluate agents built with any framework and view results in the Agenta dashboard.

Why Programmatic Evaluation?

Complex AI agents need evaluation that goes beyond UI-based testing. The Evaluation SDK provides code-level control over test data and evaluation logic. You can test agents built with any framework. Run evaluations in your CI/CD pipeline. Debug complex workflows with full trace visibility.

Key Capabilities

Test Data Management

Create test sets directly in your code or fetch existing ones from Agenta. Test sets can include ground truth data for reference-based evaluation or work without it for evaluators that only need the output.

Built-in Evaluators

The SDK includes LLM-as-a-Judge, semantic similarity, and regex matching evaluators. You can also write custom Python evaluators for your specific requirements.

Reusable Configurations

Save evaluator configurations in Agenta to reuse them across runs. Configure an evaluator once, then reference it in multiple evaluations.

Span-Level Evaluation

Evaluate your agent end to end or test specific spans in the execution trace. Test individual components like retrieval steps or tool calls separately.

Run on Your Infrastructure

Evaluations run on your infrastructure. Results appear in the Agenta dashboard with full traces and comparison views.

Getting Started

Install the SDK:

pip install agenta

Here's a minimal example evaluating a simple agent:

import agenta as ag
from agenta.sdk.evaluations import aevaluate

# Initialize
ag.init()

# Define your application
@ag.application(slug="my_agent")
async def my_agent(question: str):
    # Your agent logic here
    return answer

# Define an evaluator
@ag.evaluator(slug="correctness_check")
async def correctness_check(expected: str, outputs: str):
    return {
        "score": 1.0 if outputs == expected else 0.0,
        "success": outputs == expected,
    }

# Create test data
testset = await ag.testsets.acreate(
    name="Agent Tests",
    data=[
        {"question": "What is 2+2?", "expected": "4"},
        {"question": "What is the capital of France?", "expected": "Paris"},
    ],
)

# Run evaluation
result = await aevaluate(
    name="Agent Correctness Test",
    testsets=[testset.id],
    applications=[my_agent],
    evaluators=[correctness_check],
)

print(f"View results: {result['dashboard_url']}")

Dashboard Integration

Every evaluation run gets a shareable dashboard link. The dashboard shows full execution traces, comparison views for different versions, aggregated metrics, and individual test case details.

Next Steps

Check out the Quick Start Guide to build your first evaluation.

Online Evaluation

November 11, 2025

Online Evaluation automatically evaluates every request to your LLM application in production. Catch quality issues like hallucinations and off-brand responses as they happen.

How It Works

Online Evaluation runs evaluators on your production traces automatically. Monitor quality in real time instead of discovering issues through user complaints.

Key Features

Automatic Evaluation

Every request to your application gets evaluated automatically. The system runs your configured evaluators on each trace as it arrives.

Evaluator Configuration

Configure evaluators like LLM-as-a-Judge with custom prompts tailored to your quality criteria. Use any evaluator that works in regular evaluations.

Span-Level Evaluation

Create online evaluations with filters for specific spans in your traces. Evaluate just the retrieval step in your RAG pipeline or focus on specific tool calls in your agent.

Sampling Control

Set sampling rates to control costs. Evaluate every request during testing, then sample a percentage in production to balance quality monitoring with budget.

Filtering and Analysis

View all evaluated requests in one place. Filter traces by evaluation scores to find problematic cases. Jump into detailed traces to understand what went wrong.

Build Better Test Sets

Add problematic cases directly to your test sets. Turn production failures into regression tests.

Setup

Setting up online evaluation takes a few minutes:

Navigate to the Online Evaluation section
Select the evaluators you want to run
Configure sampling rates and span filters if needed
Enable the online evaluation

Your application traces will be automatically evaluated as they arrive.

Use Cases

Catch hallucinations by running fact-checking evaluators on every response. Monitor brand compliance using LLM-as-a-Judge evaluators with custom prompts. Track RAG quality by evaluating retrieval in real time. Monitor agent reliability by checking tool calls and reasoning steps. Build better test sets by capturing edge cases from production.

Next Steps

Learn about configuring evaluators for your quality criteria.

Customize LLM-as-a-Judge Output Schemas

November 10, 2025

The LLM-as-a-Judge evaluator now supports custom output schemas. You can define exactly what feedback structure you need for your evaluations.

What's New

Flexible Output Types

Configure the evaluator to return different types of outputs:

Binary: Return a simple yes/no or pass/fail score
Multiclass: Choose from multiple predefined categories
Custom JSON: Define any structure that fits your use case

Include Reasoning for Better Quality

Enable the reasoning option to have the LLM explain its evaluation. This improves prediction quality because the model thinks through its assessment before providing a score.

When you include reasoning, the evaluator returns both the score and a detailed explanation of how it arrived at that judgment.

Advanced: Raw JSON Schema

For complete control, provide a raw JSON schema. The evaluator will return responses that match your exact structure.

This lets you capture multiple scores, categorical labels, confidence levels, and custom fields in a single evaluation pass. You can structure the output however your workflow requires.

Use Custom Schemas in Evaluation

Once configured, your custom schemas work seamlessly in the evaluation workflow. The results display in the evaluation dashboard with all your custom fields visible.

This makes it easy to analyze multiple dimensions of quality in a single evaluation run.

Example Use Cases

Binary Score with Reasoning: Return a simple correct/incorrect judgment along with an explanation of why the output succeeded or failed.

Multi-dimensional Feedback: Capture separate scores for accuracy, relevance, completeness, and tone in one evaluation. Include reasoning for each dimension.

Structured Classification: Return categorical labels (excellent/good/fair/poor) along with specific issues found and suggestions for improvement.

Getting Started

To use custom output schemas with LLM-as-a-Judge:

Open the evaluator configuration
Select your desired output type (binary, multiclass, or custom)
Enable reasoning if you want explanations
For advanced use, provide your JSON schema
Run your evaluation

Learn more in the LLM-as-a-Judge documentation.

Documentation Architecture Overhaul

November 3, 2025

We've completely rewritten and restructured our documentation with a new architecture. This is one of the largest updates we've made to the documentation, involving a near-complete rewrite of existing content and adding substantial new material.

Diataxis Framework Implementation

We've reorganized all documentation using the Diataxis framework.

Expanded Observability Documentation

One of the biggest gaps in our previous documentation was observability. We've added comprehensive documentation covering:

JavaScript/TypeScript Support

Documentation now includes JavaScript and TypeScript examples alongside Python wherever applicable. This makes it easier for JavaScript developers to integrate Agenta into their applications.

Ask AI Feature

We've added a new "Ask AI" feature that lets you ask questions directly to the documentation. Get instant answers to your questions without searching through pages.

Vertex AI Provider Support

October 24, 2025

We've added support for Google Cloud's Vertex AI platform. You can now use Gemini models and other Vertex AI partner models directly in Agenta.

What's New

Vertex AI is now available as a provider across the platform:

Playground: Configure and test Gemini models and other Vertex AI models
Model Hub: Add your Vertex AI credentials and manage available models
Gateway: Access Vertex AI models through the InVoke endpoints

You can use any model available through Vertex AI, including:

Gemini models: Google's most capable AI models (gemini-2.5-pro, gemini-2.5-flash, etc.)
Partner models: Claude, Llama, Mistral, and other models available through Vertex AI Model Garden

Configuration

To get started with Vertex AI, go to Settings → Model Hub and add your Vertex AI credentials:

Vertex Project: Your Google Cloud project ID
Vertex Location: The region for your models (e.g., us-central1, europe-west4)
Vertex Credentials: Your service account key in JSON format

For detailed setup instructions, see our documentation on adding custom providers.

Security

All API keys and credentials are encrypted both in transit and at rest, ensuring your sensitive information stays secure.

What Are Projects?​

Creating and Managing Projects​

When to Use Projects​

Next Steps​

Viewing traces in the playground:​

Authentication improvements:​

Documentation improvements:​

Bug fixes:​

Example​

Using Jinja2 Prompts​

What's Open Source​

Building in Public Again​

What Remains Under Commercial License​

Get Started​

What This Means for You​

Why Programmatic Evaluation?​

Key Capabilities​

Test Data Management​

Built-in Evaluators​

Reusable Configurations​

Span-Level Evaluation​

Run on Your Infrastructure​

Getting Started​

Dashboard Integration​

Next Steps​

How It Works​

Key Features​

Automatic Evaluation​

Evaluator Configuration​

Span-Level Evaluation​

Sampling Control​

Filtering and Analysis​

Build Better Test Sets​

Setup​

Use Cases​

Next Steps​

What's New​

Flexible Output Types​

Include Reasoning for Better Quality​

Advanced: Raw JSON Schema​

Use Custom Schemas in Evaluation​

Example Use Cases​

Getting Started​

Diataxis Framework Implementation​

Expanded Observability Documentation​

JavaScript/TypeScript Support​

Ask AI Feature​

What's New​

Configuration​

Security​

What Are Projects?

Creating and Managing Projects

When to Use Projects

Next Steps

Viewing traces in the playground:

Authentication improvements:

Documentation improvements:

Bug fixes:

Example

Using Jinja2 Prompts

What's Open Source

Building in Public Again

What Remains Under Commercial License

Get Started

What This Means for You

Why Programmatic Evaluation?

Key Capabilities

Test Data Management

Built-in Evaluators

Reusable Configurations

Span-Level Evaluation

Run on Your Infrastructure

Getting Started

Dashboard Integration

Next Steps

How It Works

Key Features

Automatic Evaluation

Evaluator Configuration

Span-Level Evaluation

Sampling Control

Filtering and Analysis

Build Better Test Sets

Setup

Use Cases

Next Steps

What's New

Flexible Output Types

Include Reasoning for Better Quality

Advanced: Raw JSON Schema

Use Custom Schemas in Evaluation

Example Use Cases

Getting Started

Diataxis Framework Implementation

Expanded Observability Documentation

JavaScript/TypeScript Support

Ask AI Feature

What's New

Configuration

Security