Skip to main content

Evaluators can access all columns

Evaluators now can access all columns in the test set. Previously, you were limited to using only the correct_answer column for the ground truth / reference answer in evaluation. Now you can configure your evaluator to use any column in the test set as the ground truth. To do that, open the collapsable Advanced Settings when configuring the evaluator, and define the Expected Answer Column to the name of the columns containing the reference answer you want to use.

In addition to this:

  • We've upgraded the SDK to pydantic v2.
  • We have improved by 10x the speed for the get config endpoint
  • We have add documentation for observability

Playground Improvements

v0.14.1-13

  • We've improved the workflow for adding outputs to a dataset in the playground. In the past, you had to select the name of the test set each time. Now, the last used test set is selected by default..
  • We have significantly improved the debugging experience when creating applications from code. Now, if an application fails, you can view the logs to understand the reason behind the failure.
  • We moved the copy message button in the playground to the output text area.
  • We now hide the cost and usage in the playground when they aren't specified
  • We've made improvements to error messages in the playground

Bug Fixes

  • Fixed the order of the arguments when running a custom code evaluator
  • Fixed the timestamp in the Testset view (previous stamps was droppping the trailing 0)
  • Fixed the creation of application from code in the self-hosted version when using Windows

Prompt and Configuration Registry

We've introduced a feature that allows you to use Agenta as a prompt registry or management system. In the deployment view, we now provide an endpoint to directly fetch the latest version of your prompt. Here is how it looks like:


from agenta import Agenta
agenta = Agenta()
config = agenta.get_config(base_id="xxxxx", environment="production", cache_timeout=200) # Fetches the configuration with caching

You can find additional documentation here.

Improvements

  • Previously, publishing a variant from the playground to an environment was a manual process., from now on we are publishing by default to the production environment.

Miscellaneous Improvements

  • The total cost of an evaluation is now displayed in the evaluation table. This allows you to understand how much evaluations are costing you and track your expenses.

Bug Fixes

  • Fixed sidebar focus in automatic evaluation results view
  • Fix the incorrect URLs shown when running agenta variant serve

Evaluation Speed Increase and Numerous Quality of Life Improvements

v0.13.1-5

  • We've improved the speed of evaluations by 3x through the use of asynchronous batching of calls.
  • We've added Groq as a new provider along with Llama3 to our playground.

Bug Fixes

  • Resolved a rendering UI bug in Testset view.
  • Fixed incorrect URLs displayed when running the 'agenta variant serve' command.
  • Corrected timestamps in the configuration.
  • Resolved errors when using the chat template with empty input.
  • Fixed latency format in evaluation view.
  • Added a spinner to the Human Evaluation results table.
  • Resolved an issue where the gitignore was being overwritten when running 'agenta init'.

Observability (beta)

You can now monitor your application usage in production. We've added a new observability feature (currently in beta), which allows you to:

  • Monitor cost, latency, and the number of calls to your applications in real-time.
  • View the logs of your LLM calls, including inputs, outputs, and used configurations. You can also add any interesting logs to your test set.
  • Trace your more complex LLM applications to understand the logic within and debug it.

As of now, all new applications created will include observability by default. We are working towards a GA version in the next weeks, which will be scalable and better integrated with your applications. We will also be adding tutorials and documentation about it.

Find examples of LLM apps created from code with observability here.


Minor improvements

Toggle variants in comparison view

You can now toggle the visibility of variants in the comparison view, allowing you to compare a multitude of variants side-by-side at the same time.

Improvements

  • You can now add a datapoint from the playground to the test set even if there is a column mismatch

Bug fixes

  • Resolved issue with "Start Evaluation" button in Testset view
  • Fixed bug in CLI causing variant not to serve

New evaluators

We have added some more evaluators, a new string matching and a Levenshtein distance evaluation.

Improvements

  • Updated documentation for human evaluation
  • Made improvements to Human evaluation card view
  • Added dialog to indicate testset being saved in UI

Bug fixes

  • Fixed issue with viewing the full output value during evaluation
  • Enhanced error boundary logic to unblock user interface
  • Improved logic to save and retrieve multiple LLM provider keys
  • Fixed Modal instances to support dark mode