You can now create projects within an organization. This feature helps you organize your work when you're building multiple AI products or managing different teams working on separate initiatives.
You can create a new project directly from the sidebar in the Agenta interface. Once created, you can switch between projects using the sidebar navigation.
Each team member can work in different projects simultaneously. The interface remembers your last active project, making it easy to pick up where you left off.
If you're managing complex AI initiatives across multiple products, projects give you the structure to keep everything organized. You can create your first project from the sidebar and start organizing your prompts and evaluations.
For questions about projects or organizational structure, check the FAQ or reach out through our support channels.
You can now create projects within an organization. This lets you divide your work between different AI products. Each project scopes its prompts, traces, and evaluations. Create a new project or navigate between projects directly from the sidebar.
You can now configure reasoning effort for models that support this parameter, such as OpenAI's o1 series and Google's Gemini 2.5 Pro. The reasoning effort setting is part of your prompt template, making it available when you fetch prompts via the SDK or invoke them through Agenta as an LLM gateway.
We're open sourcing the core of Agenta under the MIT license. All functional features are now available to the community. This includes the evaluation system, prompt playground and management, observability, and all core workflows.
Development moves back to the public repository. We're building in public again. Only enterprise collaboration features like RBAC, SSO, and audit logs remain under a separate license.
You can now run programmatic evaluations of complex AI agents and workflows directly from code. The Evaluation SDK gives you full control over test data and evaluation logic. It works with agents built using any framework.
The SDK lets you create test sets in code or fetch them from Agenta. You can use built-in evaluators like LLM-as-a-Judge, semantic similarity, or regex matching. You can also write custom Python evaluators. The SDK evaluates end-to-end workflows or specific spans in execution traces. Evaluations run on your own infrastructure; results display in the Agenta dashboard.
You can now automatically evaluate every request to your LLM application in production. Online Evaluation helps you catch hallucinations and off-brand responses as they happen. You no longer need to discover problems through user complaints.
You can configure evaluators like LLM-as-a-Judge with custom prompts. Set sampling rates to control costs. Create evaluations with filters for specific spans in your traces. All evaluated requests appear in one dashboard. You can filter traces by evaluation scores to understand issues. You can also add problematic cases to test sets for continuous improvement.
Setting up online evaluation takes just a couple of minutes. It provides immediate visibility into production quality.
The LLM-as-a-Judge evaluator now supports custom output schemas. Create multiple feedback outputs per evaluator with any structure you need.
You can configure output types (binary, multiclass), include reasoning to improve prediction quality, or provide a raw JSON schema with any structure you define. Use these custom schemas in your evaluations to capture exactly the feedback you need.
We've completely rewritten and restructured our documentation with a new architecture. This is one of the largest updates we've made, involving a near-complete rewrite of existing content.
Key improvements include:
Diataxis Framework: Organized content into Tutorials, How-to Guides, Reference, and Explanation sections for better discoverability
Expanded Observability Docs: Added missing documentation for tracing, annotations, and observability features
We've added support for Google Cloud's Vertex AI platform. You can now use Gemini models and other Vertex AI partner models in the playground, configure them in the Model Hub, and access them through the Gateway using InVoke endpoints.
You can now filter and search traces based on their annotations. This helps you find traces with low scores or bad feedback quickly.
We rebuilt the filtering system in observability with a simpler dropdown and more options. You can now filter by span status, input keys, app or environment references, and any key within your span.
The new annotation filtering lets you find:
Spans evaluated by a specific evaluator
Spans with user feedback like success=True
This enables powerful workflows: capture user feedback from your app, filter to find traces with bad feedback, add them to test sets, and improve your prompts based on real user data.
We've completely redesigned the evaluation results dashboard. You can analyse your evaluation results more easily and understand performance across different metrics.
Here's what's new:
Metrics plots: We've added plots for all the evaluator metrics. You can not see the distribution of the results and easily spot outliers.
Side-by-side comparison: You can now compare multiple evaluations simultaneously. You can compare the plots but also the single outputs.
Improved test cases view: The results are now displayed in a tabular format works both for small and large datasets.
Focused detail view: A new focused drawer lets you examine individual data points in more details. It's very helpful if your data is large.
Configuration view: See exactly which configurations were used in each evaluation
Evaluation Run naming and descriptions: Add names and descriptions to your evaluation runs to organize things better.
URLs across Agenta now include workspace context, making them fully shareable between team members. Previously, URLs would always point to the default workspace, causing issues when refreshing pages or sharing links.
Now you can deep link to almost anything in the platform - prompts, evaluations, and more - in any workspace. Share links directly with team members and they'll see exactly what you intended, regardless of their default workspace settings.
We rewrote most of Agenta's frontend. You'll see much faster speeds when you create prompts or use the playground.
We also made many improvements and fixed bugs:
Improvements:
LLM-as-a-judge now uses double curly braces {{}} instead of single curly braces { and }. This matches how normal prompts work. Old LLM-as-a-judge prompts with single curly braces still work. We updated the LLM-as-a-judge playground to make editing prompts easier.
We rebuilt the human evaluation workflow from scratch. Now you can set multiple evaluators and metrics and use them to score the outputs.
This lets you evaluate the same output on different metrics like relevance or completeness. You can also create binary, numerical scores, or even use strings for comments or expected answer.
Watch the video below and read the post for more details. Or check out the docs to learn how to use the new human evaluation workflow.
We've made our product roadmap completely transparent and community-driven.
You can now see exactly what we're building, what's shipped, and what's coming next. Plus vote on features that matter most to you.
Why we're doing this: We believe open-source startups succeed when they create the most value possible, and the best way to do that is by building with our community, not in isolation. Up until now, we've been secretive with our roadmap, but we're losing something important: your feedback and the ability to let you shape our direction. Today we're open-sourcing our roadmap because we want to build a community of owners, not just passive users.
We've made significant improvements across Agenta with a major documentation overhaul, new model support, self-hosting enhancements, and UI improvements.
Revamped Prompt Engineering Documentation:
We've completely rewritten our prompt management and prompt engineering documentation.
Start exploring the new documentation in our updated Quick Start Guide.
New Model Support:
Our platform now supports several new LLM models:
Google's Gemini 2.5 Pro and Flash
Alibaba Cloud's Qwen 3
OpenAI's GPT-4.1
These models are available in both the playground and through the API.
Playground Enhancements:
We've added a draft state to the playground, providing a better editing experience. Changes are now clearly marked as drafts until committed.
Self-Hosting Improvements:
We've significantly simplified the self-hosting experience by changing how environment variables are handled in the frontend:
No more rebuilding images to change ports or domains
Dynamic configuration through environment variables at runtime
We are SOC 2 Type 2 Certified. This means that our platform is audited and certified by an independent third party to meet the highest standards of security and compliance.
We've introduced the Prompt and Deployment Registry, giving you a centralized place to manage all variants and versions of your prompts and deployments.
Key capabilities:
View all variants and revisions in a single table
Access all commits made to a variant
Use older versions of variants directly in the playground
We've made several improvements to the playground, including:
Improved scrolling behavior
Increased discoverability of variants creation and comparison
Implemented stop functionality in the playground
As for custom workflows, now they work with sub-routes. This means you can have multiple routes in one file and create multiple custom workflows from the same file.
We've introduced major improvements to Agenta, focusing on OpenTelemetry compliance and simplified custom workflow debugging.
OpenTelemetry (OTel) Support:
Agenta is now fully OpenTelemetry-compliant. This means you can seamlessly integrate Agenta with thousands of OTel-compatible services using existing SDKs. To integrate your application with Agenta, simply configure an OTel exporter pointing to your Agenta endpoint—no additional setup required.
We've enhanced distributed tracing capabilities to better debug complex distributed agent systems. All HTTP interactions between agents—whether running within Agenta's SDK or externally—are automatically traced, making troubleshooting and monitoring easier.
Based on your feedback, we've streamlined debugging and running custom workflows:
Run workflows from your environments: You no longer need the Agenta CLI to manage custom workflows. Setting up custom workflows now involves simply adding the Agenta SDK to your code, creating an endpoint, and connecting it to Agenta via the web UI. You can check how it's done in the quick start guide.
Custom Workflows in the new playground: Custom workflows are now fully compatible with the new playground. You can now nest configurations, run side-by-side comparisons, and debug your agents and complex workflows very easily.
We've rebuilt our playground from scratch to make prompt engineering faster and more intuitive. The old playground took 20 seconds to create a prompt - now it's instant.
Key improvements:
Create prompts with multiple messages using our new template system
Format variables easily with curly bracket syntax and a built-in validator
Switch between chat and completion prompts in one interface
Load test sets directly in the playground to iterate faster
Save successful outputs as test cases with one click
Compare different prompts side-by-side
Deploy changes straight to production
For developers, now you create prompts programmatically through our API.
You can explore these features in our updated playground documentation.
We've achieved SOC 2 Type 1 certification, validating our security controls for protecting sensitive LLM development data. This certification covers our entire platform, including prompt management, evaluation frameworks, and observability tools.
Key security features and improvements:
Data encryption in transit and at rest
Enhanced access control and authentication
Comprehensive security monitoring
Regular third-party security assessments
Backup and disaster recovery protocols
This certification represents a significant milestone for teams using Agenta in production environments. Whether you're using our open-source platform or cloud offering, you can now build LLM applications with enterprise-grade security confidence.
We've also updated our trust center with detailed information about our security practices and compliance standards. For teams interested in learning more about our security controls or requesting our SOC 2 report, please contact [email protected].
This release introduces the ability to add spans to test sets, making it easier to bootstrap your evaluation data from production. The new feature lets you:
Add individual or batch spans to test sets
Create custom mappings between spans and test sets
Preview test set changes before committing them
Additional improvements:
Fixed CSV test set upload issues
Prevented viewing of incomplete evaluations
Added mobile compatibility warning
Added support for custom ports in self-hosted installations
You can now see traces directly in the playground. For simple applications, this means you can view the prompts sent to LLMs. For custom workflows, you get an overview of intermediate steps and outputs. This makes it easier to understand what’s happening under the hood and debug your applications.
We’ve strengthened authentication for deployed applications. As you know, Agenta lets you either fetch the app’s config or call it with Agenta acting as a proxy. Now, we’ve added authentication to the second method. The APIs we create are now protected and can be called using an API key. You can find code snippets for calling the application in the overview page.
We’ll publish a full blog post soon, but here’s a quick look at what the new observability offers:
A redesigned UI that lets you visualize nested traces, making it easier to understand what’s happening behind the scenes.
The web UI lets you filter traces by name, cost, and other attributes—you can even search through them easily.
The SDK is Otel-compatible, and we’ve already tested integrations for OpenAI, LangChain, LiteLLM, and Instructor, with guides available for each. In most cases, adding a few lines of code will have you seeing traces directly in Agenta.
Next: Prompt Management
We’ve completely rewritten the prompt management SDK, giving you full CRUD capabilities for prompts and configurations. This includes creating, updating, reading history, deploying new versions, and deleting old ones. You can find a first tutorial for this here.
And finally: LLM-as-a-Judge Overhaul
We've made significant upgrades to the LLM-as-a-Judge evaluator. It now supports prompts with multiple messages and has access to all variables in a test case. You can also switch models (currently supporting OpenAI and Anthropic). These changes make the evaluator much more flexible, and we're seeing better results with it.
We updated the Application Management View to improve the UI. Many users struggled to find their applications when they had a large number, so we've improved the view and added a search bar for quick filtering.
Additionally, we are moving towards a new project structure for the application. We moved test sets and evaluators outside of the application scope. So now, you can use the same test set and evaluators in multiple applications.
Bug Fixes
Added an export button in the evaluation view to export results from the main view.
Eliminated Pydantic warnings in the CLI.
Improved error messages when fetch_config is called with wrong arguments.
Enhanced the custom code evaluation sandbox and removed the limitation that results need to be between 0 and 1
Many users faced challenges configuring evaluators in the web UI. Some
evaluators, such as LLM as a judge, custom code, or RAG evaluators can be
tricky to set up correctly on the first try. Until now, users needed to setup,
run an evaluation, check the errors, then do it again.
To address this, we've introduced a new evaluator test/debug playground. This feature allows you to test the evaluator live on real data, helping you test the configuration before committing to it and using it for evaluations.
Additionally, we have improved and redesigned the evaluation view. Both automatic and human evaluations are now within the same view but in different tabs. We're moving towards unifying all evaluator results and consolidating them in one view, allowing you to quickly get an overview of what's working.
We've completely redesigned the platform's UI. Additionally we have introduced a new overview view for your applications. This is part of a series of upcoming improvements slated for the next few weeks.
The new overview view offers:
A dashboard displaying key metrics of your application
A table with all the variants of your applications
A summary of your application's most recent evaluations
We've also added a new JSON Diff evaluator. This evaluator compares two JSON objects and provides a similarity score.
Lastly, we've updated the UI of our documentation.
We've released a new version of the SDK for creating custom applications. This Pydantic-based SDK significantly simplifies the process of building custom applications. It's fully backward compatible, so your existing code will continue to work seamlessly. We'll soon be rolling out comprehensive documentation and examples for the new SDK.
In the meantime, here's a quick example of how to use it:
import agenta as ag from agenta import Agenta from pydantic import BaseModel, Field ag.init() # Define the configuration of the application (that will be shown in the playground ) classMyConfig(BaseModel): temperature:float= Field(default=0.2) prompt_template:str= Field(default="What is the capital of {country}?") # Creates an endpoint for the entrypoint of the application @ag.route("/", config_schema=MyConfig) defgenerate(country:str)->str: # Fetch the config from the request config: MyConfig = ag.ConfigManager.get_from_route(schema=MyConfig) prompt = config.prompt_template.format(country=country) chat_completion = client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role":"user","content": prompt}], temperature=config.temperature, ) return chat_completion.choices[0].message.content
We're excited to announce two major features this week:
We've integrated RAGAS evaluators into agenta. Two new evaluators have been added: RAG Faithfulness (measuring how consistent the LLM output is with the context) and Context Relevancy (assessing how relevant the retrieved context is to the question). Both evaluators use intermediate outputs within the trace to calculate the final score.
You can now view traces directly in the playground. This feature enables you to debug your application while configuring it—for example, by examining the prompts sent to the LLM or reviewing intermediate outputs.
note
Both features are available exclusively in the cloud and enterprise versions of agenta.
Evaluators now can access all columns in the test set. Previously, you were limited to using only the correct_answer column for the ground truth / reference answer in evaluation.
Now you can configure your evaluator to use any column in the test set as the ground truth. To do that, open the collapsable Advanced Settings when configuring the evaluator, and define the Expected Answer Column to the name of the columns containing the reference answer you want to use.
In addition to this:
We've upgraded the SDK to pydantic v2.
We have improved by 10x the speed for the get config endpoint
We've improved the workflow for adding outputs to a dataset in the playground. In the past, you had to select the name of the test set each time. Now, the last used test set is selected by default..
We have significantly improved the debugging experience when creating applications from code. Now, if an application fails, you can view the logs to understand the reason behind the failure.
We moved the copy message button in the playground to the output text area.
We now hide the cost and usage in the playground when they aren't specified
We've made improvements to error messages in the playground
Bug Fixes
Fixed the order of the arguments when running a custom code evaluator
Fixed the timestamp in the Testset view (previous stamps was droppping the trailing 0)
Fixed the creation of application from code in the self-hosted version when using Windows
We've introduced a feature that allows you to use Agenta as a prompt registry or management system. In the deployment view, we now provide an endpoint to directly fetch the latest version of your prompt. Here is how it looks like:
from agenta import Agenta agenta = Agenta() config = agenta.get_config(base_id="xxxxx", environment="production", cache_timeout=200) # Fetches the configuration with caching
Previously, publishing a variant from the playground to an environment was a manual process., from now on we are publishing by default to the production environment.
The total cost of an evaluation is now displayed in the evaluation table. This allows you to understand how much evaluations are costing you and track your expenses.
Bug Fixes
Fixed sidebar focus in automatic evaluation results view
Fix the incorrect URLs shown when running agenta variant serve
You can now monitor your application usage in production. We've added a new observability feature (currently in beta), which allows you to:
Monitor cost, latency, and the number of calls to your applications in real-time.
View the logs of your LLM calls, including inputs, outputs, and used configurations. You can also add any interesting logs to your test set.
Trace your more complex LLM applications to understand the logic within and debug it.
As of now, all new applications created will include observability by default. We are working towards a GA version in the next weeks, which will be scalable and better integrated with your applications. We will also be adding tutorials and documentation about it.
Find examples of LLM apps created from code with observability here.
We've introduced the feature to version prompts, allowing you to track changes made by the team and revert to previous versions. To view the change history of the configuration, click on the sign in the playground to access all previous versions.
v0.9.1
We have added a new evaluator to match JSON fields and added the possiblity to use other columns in the test set other than the correct_answer column as the ground truth.
Up until know, we required users to use our OpenAI API key when using cloud. Starting now, you can use your own API key for any new application you create.
We've spent the past month re-engineering our evaluation workflow. Here's what's new:
Running Evaluations
Simultaneous Evaluations: You can now run multiple evaluations for different app variants and evaluators concurrently.
Rate Limit Parameters: Specify these during evaluations and reattempts to ensure reliable results without exceeding open AI rate limits.
Reusable Evaluators: Configure evaluators such as similarity match, regex match, or AI critique and use them across multiple evaluations.
Evaluation Reports
Dashboard Improvements: We've upgraded our dashboard interface to better display evaluation results. You can now filter and sort results by evaluator, test set, and outcomes.
Comparative Analysis: Select multiple evaluation runs and view the results of various LLM applications side-by-side.
This necessitated modifications to the SDK. Now, the LLM application API returns a JSON instead of a string. The JSON includes the output message, usage details, and cost:
You can now configure reasoning effort for models that support this parameter, such as OpenAI's o1 series and Google's Gemini 2.5 Pro.
Reasoning effort controls how much computational thinking the model applies before generating a response. This is particularly useful for complex reasoning tasks where you want to balance response quality with latency and cost.
The reasoning effort parameter is part of your prompt template configuration. When you fetch prompts via the SDK or invoke them through Agenta as an LLM gateway, the reasoning effort setting is included in the configuration and applied to your requests automatically.
This gives you fine-grained control over model behavior directly from the playground, making it easier to optimize for your specific use case.
We're excited to announce a powerful update to the Agenta playground. You can now use Jinja2 templating in your prompts.
This means you can add sophisticated logic directly into your prompt templates. Use conditional statements, apply filters to variables, and transform data on the fly.
The template_format field tells Agenta how to process your variables. This works both when invoking prompts through Agenta as an LLM gateway and when fetching prompts programmatically via the SDK.
Every feature you need to build, test, and deploy LLM applications is now open source. This includes the evaluation system, prompt playground and management, observability, and all core workflows.
You can run evaluations using LLM-as-a-Judge, custom code evaluators, or any built-in evaluator. Create and manage test sets. Evaluate end-to-end workflows or specific spans in traces.
Experiment with prompts in the playground. Version and commit changes. Deploy to environments. Fetch configurations programmatically.
Trace your LLM applications with OpenTelemetry support. View detailed execution traces. Monitor costs and performance. Filter and search traces.
Only enterprise collaboration features stay under a separate license. This includes role-based access control (RBAC), single sign-on (SSO), and audit logs. These features support teams with specific compliance and security requirements.
You can run Agenta on your infrastructure with full access to evaluation, prompting, and observability features. You can modify the code to fit your needs. You can contribute back to the project.
The MIT license gives you freedom to use, modify, and distribute Agenta. We believe open source creates better products through community collaboration.
The Evaluation SDK lets you run evaluations programmatically from code. You get full control over test data and evaluation logic. You can evaluate agents built with any framework and view results in the Agenta dashboard.
Complex AI agents need evaluation that goes beyond UI-based testing. The Evaluation SDK provides code-level control over test data and evaluation logic. You can test agents built with any framework. Run evaluations in your CI/CD pipeline. Debug complex workflows with full trace visibility.
Create test sets directly in your code or fetch existing ones from Agenta. Test sets can include ground truth data for reference-based evaluation or work without it for evaluators that only need the output.
The SDK includes LLM-as-a-Judge, semantic similarity, and regex matching evaluators. You can also write custom Python evaluators for your specific requirements.
Evaluate your agent end to end or test specific spans in the execution trace. Test individual components like retrieval steps or tool calls separately.
Here's a minimal example evaluating a simple agent:
import agenta as ag from agenta.sdk.evaluations import aevaluate # Initialize ag.init() # Define your application @ag.application(slug="my_agent") asyncdefmy_agent(question:str): # Your agent logic here return answer # Define an evaluator @ag.evaluator(slug="correctness_check") asyncdefcorrectness_check(expected:str, outputs:str): return{ "score":1.0if outputs == expected else0.0, "success": outputs == expected, } # Create test data testset =await ag.testsets.acreate( name="Agent Tests", data=[ {"question":"What is 2+2?","expected":"4"}, {"question":"What is the capital of France?","expected":"Paris"}, ], ) # Run evaluation result =await aevaluate( name="Agent Correctness Test", testsets=[testset.id], applications=[my_agent], evaluators=[correctness_check], ) print(f"View results: {result['dashboard_url']}")
Every evaluation run gets a shareable dashboard link. The dashboard shows full execution traces, comparison views for different versions, aggregated metrics, and individual test case details.
Online Evaluation automatically evaluates every request to your LLM application in production. Catch quality issues like hallucinations and off-brand responses as they happen.
Online Evaluation runs evaluators on your production traces automatically. Monitor quality in real time instead of discovering issues through user complaints.
Create online evaluations with filters for specific spans in your traces. Evaluate just the retrieval step in your RAG pipeline or focus on specific tool calls in your agent.
Set sampling rates to control costs. Evaluate every request during testing, then sample a percentage in production to balance quality monitoring with budget.
View all evaluated requests in one place. Filter traces by evaluation scores to find problematic cases. Jump into detailed traces to understand what went wrong.
Catch hallucinations by running fact-checking evaluators on every response. Monitor brand compliance using LLM-as-a-Judge evaluators with custom prompts. Track RAG quality by evaluating retrieval in real time. Monitor agent reliability by checking tool calls and reasoning steps. Build better test sets by capturing edge cases from production.
Enable the reasoning option to have the LLM explain its evaluation. This improves prediction quality because the model thinks through its assessment before providing a score.
When you include reasoning, the evaluator returns both the score and a detailed explanation of how it arrived at that judgment.
For complete control, provide a raw JSON schema. The evaluator will return responses that match your exact structure.
This lets you capture multiple scores, categorical labels, confidence levels, and custom fields in a single evaluation pass. You can structure the output however your workflow requires.
Once configured, your custom schemas work seamlessly in the evaluation workflow. The results display in the evaluation dashboard with all your custom fields visible.
This makes it easy to analyze multiple dimensions of quality in a single evaluation run.
Binary Score with Reasoning:
Return a simple correct/incorrect judgment along with an explanation of why the output succeeded or failed.
Multi-dimensional Feedback:
Capture separate scores for accuracy, relevance, completeness, and tone in one evaluation. Include reasoning for each dimension.
Structured Classification:
Return categorical labels (excellent/good/fair/poor) along with specific issues found and suggestions for improvement.
We've completely rewritten and restructured our documentation with a new architecture. This is one of the largest updates we've made to the documentation, involving a near-complete rewrite of existing content and adding substantial new material.
Documentation now includes JavaScript and TypeScript examples alongside Python wherever applicable. This makes it easier for JavaScript developers to integrate Agenta into their applications.
We've added a new "Ask AI" feature that lets you ask questions directly to the documentation. Get instant answers to your questions without searching through pages.