Skip to main content

Minor improvements

  • Improved the logic of the Webhook evaluator
  • Made the inputs in the Human evaluation view non-editable
  • Added an option to save a test set in the Single model evaluation view
  • Included the evaluator name in the "Configure your evaluator" modal

Bug fixes

  • Fixed column resize in comparison view
  • Resolved a bug affecting the evaluation output in the CSV file
  • Corrected the path to the Evaluators view when navigating from Evaluations

Highlight ouput difference when comparing evaluations

We have improved the evaluation comparison view to show the difference to the expected output.

Improvements

  • Improved the error messages when invoking LLM applications
  • Improved "Add new evaluation" modal
  • Upgraded Sidemenu to display Configure evaluator and run evaluator under Evaluations section
  • Changed cursor to pointer when hovering over evaluation results

Deployment Versioning and RBAC

Deployment versioning

You now have access to a history of prompts deployed to our three environments. This feature allows you to roll back to previous versions if needed.

Role-Based Access Control

You can now invite team members and assign them fine-grained roles in agenta.

Improvements

  • We now prevent the deletion of test sets that are used in evaluations

Bug fixes

  • Fixed bug in custom code evaluation aggregation. Up until know the aggregated result for custom code evalution where not computed correctly.

  • Fixed bug with Evaluation results not being exported correctly

  • Updated documentation for vision gpt explain images

  • Improved Frontend test for Evaluations


Minor fixes

  • Addressed issue when invoking LLM app with missing LLM provider key
  • Updated LLM providers in Backend enum
  • Fixed bug in variant environment deployment
  • Fixed the sorting in evaluation tables
  • Made use of server timezone instead of UTC

Prompt Versioning

We've introduced the feature to version prompts, allowing you to track changes made by the team and revert to previous versions. To view the change history of the configuration, click on the sign in the playground to access all previous versions.


New JSON Evaluator

We have added a new evaluator to match JSON fields and added the possiblity to use other columns in the test set other than the correct_answer column as the ground truth.


Improved error handling in evaluation

We have improved error handling in evaluation to return more information about the exact source of the error in the evaluation view.

Improvements:

  • Added the option in A/B testing human evaluation to mark both variants as correct
  • Improved loading state in Human Evaluation

Bring your own API key

Up until know, we required users to use our OpenAI API key when using cloud. Starting now, you can use your own API key for any new application you create.


Improved human evaluation workflow

Faster human evaluation workflow

We have updated the human evaluation table view to add annotation and correct answer columns.

Improvements:

  • Simplified the database migration process
  • Fixed environment variable injection to enable cloud users to use their own keys
  • Disabled import from endpoint in cloud due to security reasons
  • Improved query lookup speed for evaluation scenarios
  • Improved error handling in playground

Bug fixes:

  • Resolved failing Backend tests
  • Fixed a bug in rate limit configuration validation
  • Fixed issue with all aggregated results
  • Resolved issue with live results in A/B testing evaluation not updating

Revamping evaluation

We've spent the past month re-engineering our evaluation workflow. Here's what's new:

Running Evaluations

  1. Simultaneous Evaluations: You can now run multiple evaluations for different app variants and evaluators concurrently.
  1. Rate Limit Parameters: Specify these during evaluations and reattempts to ensure reliable results without exceeding open AI rate limits.
  1. Reusable Evaluators: Configure evaluators such as similarity match, regex match, or AI critique and use them across multiple evaluations.

Evaluation Reports

  1. Dashboard Improvements: We've upgraded our dashboard interface to better display evaluation results. You can now filter and sort results by evaluator, test set, and outcomes.
  1. Comparative Analysis: Select multiple evaluation runs and view the results of various LLM applications side-by-side.