Skip to main content

Minor fixes

  • Addressed issue when invoking LLM app with missing LLM provider key
  • Updated LLM providers in Backend enum
  • Fixed bug in variant environment deployment
  • Fixed the sorting in evaluation tables
  • Made use of server timezone instead of UTC

Prompt Versioning

We've introduced the feature to version prompts, allowing you to track changes made by the team and revert to previous versions. To view the change history of the configuration, click on the sign in the playground to access all previous versions.


New JSON Evaluator

We have added a new evaluator to match JSON fields and added the possiblity to use other columns in the test set other than the correct_answer column as the ground truth.


Improved error handling in evaluation

We have improved error handling in evaluation to return more information about the exact source of the error in the evaluation view.

Improvements:

  • Added the option in A/B testing human evaluation to mark both variants as correct
  • Improved loading state in Human Evaluation

Bring your own API key

Up until know, we required users to use our OpenAI API key when using cloud. Starting now, you can use your own API key for any new application you create.


Improved human evaluation workflow

Faster human evaluation workflow

We have updated the human evaluation table view to add annotation and correct answer columns.

Improvements:

  • Simplified the database migration process
  • Fixed environment variable injection to enable cloud users to use their own keys
  • Disabled import from endpoint in cloud due to security reasons
  • Improved query lookup speed for evaluation scenarios
  • Improved error handling in playground

Bug fixes:

  • Resolved failing Backend tests
  • Fixed a bug in rate limit configuration validation
  • Fixed issue with all aggregated results
  • Resolved issue with live results in A/B testing evaluation not updating

Revamping evaluation

We've spent the past month re-engineering our evaluation workflow. Here's what's new:

Running Evaluations

  1. Simultaneous Evaluations: You can now run multiple evaluations for different app variants and evaluators concurrently.
  1. Rate Limit Parameters: Specify these during evaluations and reattempts to ensure reliable results without exceeding open AI rate limits.
  1. Reusable Evaluators: Configure evaluators such as similarity match, regex match, or AI critique and use them across multiple evaluations.

Evaluation Reports

  1. Dashboard Improvements: We've upgraded our dashboard interface to better display evaluation results. You can now filter and sort results by evaluator, test set, and outcomes.
  1. Comparative Analysis: Select multiple evaluation runs and view the results of various LLM applications side-by-side.

Adding Cost and Token Usage to the Playground

caution

This change requires you to pull the latest version of the agenta platform if you're using the self-serve version.

We've added a feature that allows you to compare the time taken by an LLM app, its cost, and track token usage, all in one place.

----#