Skip to main content

Speed Improvements in the Playground

We rewrote most of Agenta's frontend. You'll see much faster speeds when you create prompts or use the playground.

We also made many improvements and fixed bugs:

  • LLM-as-a-judge now uses double curly braces {{}} instead of single curly braces { and }. This matches how normal prompts work. Old LLM-as-a-judge prompts with single curly braces still work. We updated the LLM-as-a-judge playground to make editing prompts easier.
  • You can now use an external Redis instance for caching by setting it as an environment variable
  • Fixed the custom workflow quick start tutorial and examples
  • Fixed SDK compatibility issues with Python 3.9
  • Fixed default filtering in observability dashboard
  • Fixed error handling in the evaluator playground
  • Fixed the Tracing SDK to allow instrumenting streaming responses and overriding OTEL environment variables

Multiple Metrics in Human Evaluation

We spent the past months rethinking how evaluation should work. Today we're announcing one of the first big improvements.

The fastest teams building LLM apps were using human evaluation to check their outputs before going live. Agenta was helping them do this in minutes.

But we also saw that they were limited. You could only score the outputs with one metric.

That's why we rebuilt the human evaluation workflow.

Now you can set multiple evaluators and metrics and use them to score the outputs. This lets you evaluate the same output on different metrics like relevance or completeness. You can also create binary, numerical scores, or even use strings for comments or expected answer.

This unlocks a whole new set of use cases:

  • Compare your prompts on multiple metrics and understand where you can improve.
  • Turn your annotations into test sets and use them in prompt engineering. For instance, you can add comments that help you later in improve your prompts.
  • Use human evaluation to bootstrap automatic evaluation. You can annotate your outputs with the expected answer or a rubic, then use it to set up an automatic evaluation.

Watch the video below and read the post for more details. Or check out the docs to learn how to use the new human evaluation workflow.


Major Playground Improvements and Enhancements

We've made lots of improvements to the playground. Here are some of the highlights:

JSON Editor Improvements

Enhanced Error Display and Editing

The JSON editor now provides clearer error messages and improved editing functionality. We've fixed issues with error display that previously made it difficult to debug JSON configuration problems.

Undo Support with Ctrl+Z

You can now use Ctrl+Z (or Cmd+Z on Mac) to undo changes in the JSON editor, making it much easier to iterate on complex JSON configurations without fear of losing your work.

Bug Fix: JSON Field Order Preservation

The structured output JSON field order is now preserved throughout the system. This is crucial when working with LLMs that are sensitive to the ordering of JSON fields in their responses.

Previously, JSON objects might have their field order changed during processing, which could affect LLM behavior and evaluation consistency. Now, the exact order you define is maintained across all operations.

Playground Improvements

Dynamic variables

We've improved the editor behavior with dynamic variables in the prompt.

Markdown and Text View Toggle

You can now switch between markdown and text view for messages.

Collapsible Interface Elements

We've added the ability to collapse various sections of the playground interface, helping you focus on what matters most for your current task.

Collapsible Test Cases for Large Sets

When loading large test sets, you can now collapse individual test cases to better manage the interface.

Visual diff when committing changes

The playground now shows a visual diff when you're committing changes, making it easy to review exactly what modifications you're about to save.


Support for Images in the Playground

Agenta now supports images in the playground, test sets, and evaluations. This enables a systematic workflow for developing and testing applications that use vision models.

New Features:

  • Image Support in Playground: Add images directly to your prompts when experimenting in the playground.
  • Multi-modal Test Sets: Create and manage test sets that include image inputs alongside text.
  • Image-based Evaluations: Run evaluations on prompts designed to process images, allowing for systematic comparison of different prompt versions or models.

LlamaIndex Integration

We're excited to announce observability support for LlamaIndex applications.

If you're using LlamaIndex, you can now see detailed traces in Agenta to debug your application.

The integration is auto-instrumentation - just add one line of code and you'll start seeing all your LlamaIndex operations traced.

This helps when you need to understand what's happening inside your RAG pipeline, track performance bottlenecks, or debug issues in production.

Check out the tutorial and the Jupyter notebook for more details.

Annotate Your LLM Response (preview)

One of the major feature requests we had was the ability to capture user feedback and annotations (e.g. scores) to LLM responses traced in Agenta.

Today we're previewing one of a family of features around this topic.

As of today you can use the annotation API to add annotations to LLM responses traced in Agenta.

This is useful to:

  • Collect user feedback on LLM responses
  • Run custom evaluation workflows
  • Measure application performance in real-time

Check out the how to annotate traces from API for more details. Or try our new tutorial (available as jupyter notebook) here.

Other stuff:

  • We have cut our migration process to take a couple of minutes instead of an hour.

Tool Support in the Playground

We released tool usage in the Agenta playground - a key feature for anyone building agents with LLMs.

Agents need tools to access external data, perform calculations, or call APIs.

Now you can:

  • Define tools directly in the playground using JSON schema
  • Test how your prompt generates tool calls in real-time
  • Preview how your agent handles tool responses
  • Verify tool call correctness with custom evaluators

The tool schema is saved with your prompt configuration, making integration easy when you fetch configs through the API.


Documentation Overhaul, New Models, and Platform Improvements

We've made significant improvements across Agenta with a major documentation overhaul, new model support, self-hosting enhancements, and UI improvements.

Revamped Prompt Engineering Documentation:

We've completely rewritten our prompt management and prompt engineering documentation.

Start exploring the new documentation in our updated Quick Start Guide.

New Model Support:

Our platform now supports several new LLM models:

  • Google's Gemini 2.5 Pro and Flash
  • Alibaba Cloud's Qwen 3
  • OpenAI's GPT-4.1

These models are available in both the playground and through the API.

Playground Enhancements:

We've added a draft state to the playground, providing a better editing experience. Changes are now clearly marked as drafts until committed.

Self-Hosting Improvements:

We've significantly simplified the self-hosting experience by changing how environment variables are handled in the frontend:

  • No more rebuilding images to change ports or domains
  • Dynamic configuration through environment variables at runtime

Check out our updated self-hosting documentation for details.

Bug Fixes and Optimizations:

  • Fixed OpenTelemetry integration edge cases
  • Resolved edge cases in the API that affected certain workflow configurations
  • Improved UI responsiveness and fixed minor visual inconsistencies
  • Added chat support in cloud

We are SOC 2 Type 2 Certified

We are SOC 2 Type 2 Certified. This means that our platform is audited and certified by an independent third party to meet the highest standards of security and compliance.


Structured Output Support in the Playground

We now support structured output support in the playground. You can define the expected output format and validate the output against it.

With Agenta's playground, implementing structured outputs is straightforward:

  • Open any prompt

  • Switch the Response format dropdown from text to JSON mode or JSON Schema

  • Paste or write your schema (Agenta supports the full JSON Schema specification)

  • Run the prompt - the response panel will show the response beautified

  • Commit the changes - the schema will be saved with your prompt, so when your SDK fetches the prompt, it will include the schema information

Check out the blog post for more detail https://agenta.ai/blog/structured-outputs-playground