Launch Week #2 Day 2: Online Evaluation

Launch Week #2 Day 2: Online Evaluation

Launch Week #2 Day 2: Online Evaluation

Nov 11, 2025

Nov 11, 2025

-

5 minutes

5 minutes

Ship reliable agents faster

Build reliable LLM apps together with integrated prompt management, evaluation, and observability.

Did your customer support agent ever mention a competitor?

That would be awful.

And if you build AI apps for health or finance, it could be worse. One wrong answer can change someone’s life.

Pre-production evals can’t catch everything. You will never know exactly how users will interact with your app. You will never see all the edge cases of real-world data until you’re in production.

Today we're launching Online Evaluation. This feature closes the LLMOps feedback loop and solves this problem.

With Online Evaluation, every AI request gets evaluated in real time. You can spot hallucinations, off-brand answers, and subtle regressions as they happen.

With Online Evaluation, you get:

  • A live view of the reliability of your system in production

  • Confidence that your outputs meet your quality standards

  • A way to find edge cases and add them to your test cases to improve your AI system

  • Clear insight into how prompt changes behave in production

How it works:

  • Pick an evaluator (use an LLM-as-a-judge or write your own evaluator logic in Python)

  • Provide filters to target the right spans and set the sampling rate to control your cost and coverage

  • Measure changes against live traffic, spot regressions, and add them to your test set

You can set up online evaluation in a couple of minutes: Add one line to instrument your application. Then set up online evaluation with a few clicks.

Check out our docs to get started.

Did your customer support agent ever mention a competitor?

That would be awful.

And if you build AI apps for health or finance, it could be worse. One wrong answer can change someone’s life.

Pre-production evals can’t catch everything. You will never know exactly how users will interact with your app. You will never see all the edge cases of real-world data until you’re in production.

Today we're launching Online Evaluation. This feature closes the LLMOps feedback loop and solves this problem.

With Online Evaluation, every AI request gets evaluated in real time. You can spot hallucinations, off-brand answers, and subtle regressions as they happen.

With Online Evaluation, you get:

  • A live view of the reliability of your system in production

  • Confidence that your outputs meet your quality standards

  • A way to find edge cases and add them to your test cases to improve your AI system

  • Clear insight into how prompt changes behave in production

How it works:

  • Pick an evaluator (use an LLM-as-a-judge or write your own evaluator logic in Python)

  • Provide filters to target the right spans and set the sampling rate to control your cost and coverage

  • Measure changes against live traffic, spot regressions, and add them to your test set

You can set up online evaluation in a couple of minutes: Add one line to instrument your application. Then set up online evaluation with a few clicks.

Check out our docs to get started.

Did your customer support agent ever mention a competitor?

That would be awful.

And if you build AI apps for health or finance, it could be worse. One wrong answer can change someone’s life.

Pre-production evals can’t catch everything. You will never know exactly how users will interact with your app. You will never see all the edge cases of real-world data until you’re in production.

Today we're launching Online Evaluation. This feature closes the LLMOps feedback loop and solves this problem.

With Online Evaluation, every AI request gets evaluated in real time. You can spot hallucinations, off-brand answers, and subtle regressions as they happen.

With Online Evaluation, you get:

  • A live view of the reliability of your system in production

  • Confidence that your outputs meet your quality standards

  • A way to find edge cases and add them to your test cases to improve your AI system

  • Clear insight into how prompt changes behave in production

How it works:

  • Pick an evaluator (use an LLM-as-a-judge or write your own evaluator logic in Python)

  • Provide filters to target the right spans and set the sampling rate to control your cost and coverage

  • Measure changes against live traffic, spot regressions, and add them to your test set

You can set up online evaluation in a couple of minutes: Add one line to instrument your application. Then set up online evaluation with a few clicks.

Check out our docs to get started.

Co-Founder Agenta & LLM Engineering Expert

Ship reliable agents faster with Agenta

Build reliable LLM apps together with integrated prompt
management, evaluation, and observability.

Ship reliable agents faster with Agenta

Build reliable LLM apps together with integrated prompt
management, evaluation, and observability.

Ship reliable agents faster with Agenta

Build reliable LLM apps together with integrated prompt
management, evaluation, and observability.