Launch Week
See how your prompts perform across all metrics at a glance
Compare prompt versions side by side to spot regressions fast
Debug with complete traces to understand every output
Customize LLM-as-a-judge evaluators with any schema you need
Live view of the reliability of your system in production
Gain confidence that your outputs meet your quality standards
Find edge cases and add them to your test cases to improve your AI system
Clear insight into how prompt changes behave in production
Create or fetch test sets programatically
Write custom evaluators or use buiilt-in evaluators
Evaluate end to end or specific steps
View results in the dashboard
All functional features now open source (MIT license)
Includes evaluation, prompt management, and observability
Development back in the public repo
Use Jinja2 in your prompt templates
Determine the syntax when fetching the prompt or use it in the gateway






