Skip to main content

Revamping evaluation

We've spent the past month re-engineering our evaluation workflow. Here's what's new:

Running Evaluations

  1. Simultaneous Evaluations: You can now run multiple evaluations for different app variants and evaluators concurrently.
  1. Rate Limit Parameters: Specify these during evaluations and reattempts to ensure reliable results without exceeding open AI rate limits.
  1. Reusable Evaluators: Configure evaluators such as similarity match, regex match, or AI critique and use them across multiple evaluations.

Evaluation Reports

  1. Dashboard Improvements: We've upgraded our dashboard interface to better display evaluation results. You can now filter and sort results by evaluator, test set, and outcomes.
  1. Comparative Analysis: Select multiple evaluation runs and view the results of various LLM applications side-by-side.

Adding Cost and Token Usage to the Playground

caution

This change requires you to pull the latest version of the agenta platform if you're using the self-serve version.

We've added a feature that allows you to compare the time taken by an LLM app, its cost, and track token usage, all in one place.

----#

Comprehensive Updates and Bug Fixes

  • Incorporated all chat turns to the chat set
  • Rectified self-hosting documentation
  • Introduced asynchronous support for applications
  • Added 'register_default' alias
  • Fixed a bug in the side-by-side feature

Integrated File Input and UI Enhancements

  • Integrated file input feature in the SDK
  • Provided an example that includes images
  • Upgraded the human evaluation view to present larger inputs
  • Fixed issues related to data overwriting in the cloud
  • Implemented UI enhancements to the side bar

Multiple UI and CSV Reader Fixes

  • Fixed a bug impacting the csv reader
  • Addressed an issue of variant overwriting
  • Made tabs draggable for better UI navigation
  • Implemented support for multiple LLM keys in the UI