Roadmap
What we shipped, what we are building next, and what we plan to build.
Last Shipped
Programmatic Evaluation through the SDK
11/11/2025
Evaluation
Run evaluations programmatically from code with full control over test data and evaluation logic. Evaluate agents built with any framework and view results in the Agenta dashboard.
Online Evaluation
11/11/2025
Evaluation
Automatically evaluate every request to your LLM application in production. Catch hallucinations and off-brand responses as they happen instead of discovering them through user complaints.
Customize LLM-as-a-Judge Output Schemas
11/10/2025
Evaluation
Configure LLM-as-a-Judge evaluators with custom output schemas. Use binary, multiclass, or custom JSON formats. Enable reasoning for better evaluation quality.
Structured Output Support in the Playground
4/15/2025
Playground
Define and validate structured output formats in the playground. Save structured output schemas as part of your prompt configuration.
Vertex AI Provider Support
10/24/2025
IntegrationPlayground
Use Google Cloud's Vertex AI models including Gemini and partner models in the playground, Model Hub, and through Gateway endpoints.
Filtering Traces by Annotation
10/14/2025
Observability
Filter and search for traces based on their annotations. Find traces with low scores or feedback quickly using the rebuilt filtering system.
New Evaluation Results Dashboard
9/26/2025
Evaluation
Completely redesigned evaluation results dashboard with performance plots, side-by-side comparison, improved testcases view, focused detail view, configuration visibility, and run naming.
In progress
Folders for Prompt Organization
Playground
Create folders and subfolders to organize prompts in the playground. Move prompts between folders and search within specific folders to structure prompt libraries.
Projects and Workspaces
Misc
Improve organization structure by adding projects. Create projects for different products and scope resources to specific projects.
Jinja2 Template Support in the Playground
Playground
Add Jinja2 template support to enable conditional logic, filters, and template blocks in prompts. The prompt type will be stored in the schema, and the SDK will handle rendering.
PDF Support in the Playground
Playground
Add PDF support for models that support it (OpenAI, Gemini, etc.) through base64 encoding, URLs, or file IDs. Support extends to human evaluation for reviewing model responses on PDF inputs.
Prompt Snippets
Playground
Create reusable prompt snippets that can be referenced across multiple prompts. Reference specific versions or always use the latest version to maintain consistency across prompt variants.
Date Range Filtering in Metrics Dashboard
Observability
We are adding the ability to filter traces by date range in the metrics dashboard.
Planned
AI-Powered Prompt Refinement in the Playground
Playground
Analyze prompts and suggest improvements based on best practices. Identify issues, propose refined versions, and allow users to accept, modify, or reject suggestions.
Open Observability Spans Directly in the Playground
PlaygroundObservability
Add a button in observability to open any chat span directly in the playground. Creates a stateless playground session pre-filled with the exact prompt, configuration, and inputs for immediate iteration.
Improving Navigation between Testsets in the Playground
Playground
We are making it easy to use and navigate in the playground with large testsets .
Appending Single Testcases in the Playground
Playground
Using testcases from different testsets is not possible right now in the Playground. We are adding the ability to append a single testcase to a testset.
Improving Testset View
Evaluation
We are reworking the testset view to make it easier to visualize and edit testsets.
Prompt Caching in the SDK
SDK
We are adding the ability to cache prompts in the SDK.
Testset Versioning
Evaluation
We are adding the ability to version testsets. This is useful for correctly comparing evaluation results.
Tagging Traces, Testsets, Evaluations and Prompts
Evaluation
We are adding the ability to tag traces, testsets, evaluations and prompts. This is useful for organizing and filtering your data.
Support for built-in LLM Tools (e.g. web search) in the Playground
Playground
We are adding the ability to use built-in LLM tools (e.g. web search) in the playground.
Feature Requests
Upvote or comment on the features you care about or request a new feature.