Skip to main content

Roadmap

What we shipped, what we are building next, and what we plan to build.

Last Shipped

Programmatic Evaluation through the SDK
11/11/2025
Evaluation
Run evaluations programmatically from code with full control over test data and evaluation logic. Evaluate agents built with any framework and view results in the Agenta dashboard.
Online Evaluation
11/11/2025
Evaluation
Automatically evaluate every request to your LLM application in production. Catch hallucinations and off-brand responses as they happen instead of discovering them through user complaints.
Customize LLM-as-a-Judge Output Schemas
11/10/2025
Evaluation
Configure LLM-as-a-Judge evaluators with custom output schemas. Use binary, multiclass, or custom JSON formats. Enable reasoning for better evaluation quality.
Structured Output Support in the Playground
4/15/2025
Playground
Define and validate structured output formats in the playground. Save structured output schemas as part of your prompt configuration.
Vertex AI Provider Support
10/24/2025
IntegrationPlayground
Use Google Cloud's Vertex AI models including Gemini and partner models in the playground, Model Hub, and through Gateway endpoints.
Filtering Traces by Annotation
10/14/2025
Observability
Filter and search for traces based on their annotations. Find traces with low scores or feedback quickly using the rebuilt filtering system.
New Evaluation Results Dashboard
9/26/2025
Evaluation
Completely redesigned evaluation results dashboard with performance plots, side-by-side comparison, improved testcases view, focused detail view, configuration visibility, and run naming.

In progress

Planned

AI-Powered Prompt Refinement in the Playground
Playground
Analyze prompts and suggest improvements based on best practices. Identify issues, propose refined versions, and allow users to accept, modify, or reject suggestions.
Open Observability Spans Directly in the Playground
PlaygroundObservability
Add a button in observability to open any chat span directly in the playground. Creates a stateless playground session pre-filled with the exact prompt, configuration, and inputs for immediate iteration.
Improving Navigation between Testsets in the Playground
Playground
We are making it easy to use and navigate in the playground with large testsets .
Appending Single Testcases in the Playground
Playground
Using testcases from different testsets is not possible right now in the Playground. We are adding the ability to append a single testcase to a testset.
Improving Testset View
Evaluation
We are reworking the testset view to make it easier to visualize and edit testsets.
Prompt Caching in the SDK
SDK
We are adding the ability to cache prompts in the SDK.
Testset Versioning
Evaluation
We are adding the ability to version testsets. This is useful for correctly comparing evaluation results.
Tagging Traces, Testsets, Evaluations and Prompts
Evaluation
We are adding the ability to tag traces, testsets, evaluations and prompts. This is useful for organizing and filtering your data.
Support for built-in LLM Tools (e.g. web search) in the Playground
Playground
We are adding the ability to use built-in LLM tools (e.g. web search) in the playground.

Feature Requests

Upvote or comment on the features you care about or request a new feature.