Skip to main content

RAGAS Evaluators and Traces in the Playground

We're excited to announce two major features this week:

  1. We've integrated RAGAS evaluators into agenta. Two new evaluators have been added: RAG Faithfulness (measuring how consistent the LLM output is with the context) and Context Relevancy (assessing how relevant the retrieved context is to the question). Both evaluators use intermediate outputs within the trace to calculate the final score.

    Check out the tutorial to learn how to use RAG evaluators.

  1. You can now view traces directly in the playground. This feature enables you to debug your application while configuring it—for example, by examining the prompts sent to the LLM or reviewing intermediate outputs.

note

Both features are available exclusively in the cloud and enterprise versions of agenta.


Migration from MongoDB to Postgres

We have migrated the Agenta database from MongoDB to Postgres. As a result, the platform is much more faster (up to 10x in some use cases).

However, if you are self-hosting agenta, note that this is a breaking change that requires you to manually migrate your data from MongoDB to Postgres.

If you are using the cloud version of Agenta, there is nothing you need to do (other than enjoying the new performance improvements).


More Reliable Evaluations

We have worked extensively on improving the reliability of evaluations. Specifically:

  • We improved the status for evaluations and added a new Queued status
  • We improved the error handling in evaluations. Now we show the exact error message that caused the evaluation to fail.
  • We fixed issues that caused evaluations to run infinitely
  • We fixed issues in the calculation of scores in human evaluations.
  • We fixed small UI issues with large output in human evaluations.
  • We have added a new export button in the evaluation view to export the results as a CSV file.

In observability:

  • We have added a new integration with Litellm to automatically trace all LLM calls done through it.
  • Now we automatically propagate cost and token usage from spans to traces.

Evaluators can access all columns

Evaluators now can access all columns in the test set. Previously, you were limited to using only the correct_answer column for the ground truth / reference answer in evaluation. Now you can configure your evaluator to use any column in the test set as the ground truth. To do that, open the collapsable Advanced Settings when configuring the evaluator, and define the Expected Answer Column to the name of the columns containing the reference answer you want to use.

In addition to this:

  • We've upgraded the SDK to pydantic v2.
  • We have improved by 10x the speed for the get config endpoint
  • We have add documentation for observability

Playground Improvements

v0.14.1-13

  • We've improved the workflow for adding outputs to a dataset in the playground. In the past, you had to select the name of the test set each time. Now, the last used test set is selected by default..
  • We have significantly improved the debugging experience when creating applications from code. Now, if an application fails, you can view the logs to understand the reason behind the failure.
  • We moved the copy message button in the playground to the output text area.
  • We now hide the cost and usage in the playground when they aren't specified
  • We've made improvements to error messages in the playground

Bug Fixes

  • Fixed the order of the arguments when running a custom code evaluator
  • Fixed the timestamp in the Testset view (previous stamps was droppping the trailing 0)
  • Fixed the creation of application from code in the self-hosted version when using Windows

Prompt and Configuration Registry

We've introduced a feature that allows you to use Agenta as a prompt registry or management system. In the deployment view, we now provide an endpoint to directly fetch the latest version of your prompt. Here is how it looks like:


from agenta import Agenta
agenta = Agenta()
config = agenta.get_config(base_id="xxxxx", environment="production", cache_timeout=200) # Fetches the configuration with caching

You can find additional documentation here.

Improvements

  • Previously, publishing a variant from the playground to an environment was a manual process., from now on we are publishing by default to the production environment.

Miscellaneous Improvements

  • The total cost of an evaluation is now displayed in the evaluation table. This allows you to understand how much evaluations are costing you and track your expenses.

Bug Fixes

  • Fixed sidebar focus in automatic evaluation results view
  • Fix the incorrect URLs shown when running agenta variant serve

Evaluation Speed Increase and Numerous Quality of Life Improvements

v0.13.1-5

  • We've improved the speed of evaluations by 3x through the use of asynchronous batching of calls.
  • We've added Groq as a new provider along with Llama3 to our playground.

Bug Fixes

  • Resolved a rendering UI bug in Testset view.
  • Fixed incorrect URLs displayed when running the 'agenta variant serve' command.
  • Corrected timestamps in the configuration.
  • Resolved errors when using the chat template with empty input.
  • Fixed latency format in evaluation view.
  • Added a spinner to the Human Evaluation results table.
  • Resolved an issue where the gitignore was being overwritten when running 'agenta init'.

Observability (beta)

You can now monitor your application usage in production. We've added a new observability feature (currently in beta), which allows you to:

  • Monitor cost, latency, and the number of calls to your applications in real-time.
  • View the logs of your LLM calls, including inputs, outputs, and used configurations. You can also add any interesting logs to your test set.
  • Trace your more complex LLM applications to understand the logic within and debug it.

As of now, all new applications created will include observability by default. We are working towards a GA version in the next weeks, which will be scalable and better integrated with your applications. We will also be adding tutorials and documentation about it.

Find examples of LLM apps created from code with observability here.